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Preface 



This volume contains the revised versions of selected papers presented during the 
31®'^ Annual Conference of the German Classification Society (Gesellschaft ftir Klas- 
sifikation - GfKl). The conference was held at the Albert-Ludwigs-University in 
Freiburg, Germany, in March 2007. The focus of the conference was on Data Analy- 
sis, Machine Learning, and Applications, it comprised 200 talks in 36 sessions. Ad- 
ditionally 11 plenary and semi-plenary talks were held by outstanding researchers. 
With 292 participants from 19 countries in Europe and overseas this GfKl Confer- 
ence, once again, provided an international forum for discussions and mutual ex- 
change of knowledge with colleagues from different helds of interest. From alto- 
gether 120 full papers that had been submitted for this volume 82 were finally ac- 
cepted. 

With the occasion of the 30*' anniversary of the German Classification Society 
the associated societies Sekcja Klasyfikacji i Analizy Danych PTS (SKAD), Verenig- 
ing voor Ordinatie en Classificatie (VOC), Japanese Classification Society (JCS) and 
Classification and Data Analysis Group (CLADAG) have sponsored the following in- 
vited talks: Paul Eilers - Statistical Classification for Reliable High-volume Genetic 
Measurements (VOC); Eugeniusz Gatnar - Fusion of Multiple Statistical Classifiers 
(SKAD); Akinori Okada - Two-Dimensional Centrality of a Social Network (JCS); 
Donatella Vicari - Unsupervised Multivariate Prediction Including Dimensionality 
Reduction (CLADAG). 

The scientific program included a broad range of topics, besides the main theme 
of the conference, especially methods and applications of data analysis and machine 
learning were considered. The following sessions were established: 

I. Theory and Methods 

Supervised Classification, Discrimination, and Pattern Recognition (G. Ritter); Clus- 
ter Analysis and Similarity Structures (H.-H. Bock and J. Buhmann); Classifica- 
tion and Regression (C. Bailer-Jones and C. Hennig); Frequent Pattern Mining (C. 
Borgelt); Data Visualization and Scaling Methods (P. Groenen, T. Imaizumi, and A. 
Okada); Exploratory Data Analysis and Data Mining (M. Meyer and M. Schwaiger); 
Mixture Analysis in Clustering (S. Ingrassia, D. Karlis, P. Schlattmann and W. Sei- 




VI 



Preface 



del); Knowledge Representation and Knowledge Discovery (A. Ultsch); Statistical 
Relational Learning (H. Blockeel and K. Kersting); Online Algorithms and Data 
Streams (C. Sohler); Analysis of Time Series, Longitudinal and Panel Data (S. Lang); 
Tools for Intelligent Data Analysis (M. Hahsler and K. Hornik); Data Preprocessing 
and Information Extraction (H.-J. Lenz); Typing for Modeling (W. Esswein). 

II. Applications 

Marketing and Management Science (D. Baier, Y. Boztug, and W. Steiner); Banking 
and Einance (K. Jajuga and H. Locarek-Junge); Business Intelligence and Person- 
alization (A. Geyer-Schulz and L. Schmidt-Thieme); Data Analysis in Retailing (T. 
Reutterer); Econometrics and Operations Research (W. Polasek); Image and Sig- 
nal Analysis (H. Burkhardt); Biostatistics and Bioinformatics (R. Backofen, H.-P. 
Klenk and B. Lausen); Medical and Health Sciences (K.-D. Wernecke); Text Mining, 
Web Mining, and the Semantic Web (A. Niirnberger and M. Spiliopoulou); Statistical 
Natural Language Processing (P Cimiano); Linguistics (H. Goebl and P Grzybek); 
Subject Indexing and Library Science (H.-J. Hermes and B. Lorenz); Statistical Mu- 
sicology (C. Weihs); Archaeology and Archaeometry (M. Helfert and I. Herzog); 
Psychology (S. Krolak-Schwerdt); Data Analysis in Higher Education (A. Hilbert). 

Contributed Sessions (by CLADAG and SKAD) 

Latent class models for classification (A. Montanari and A. Cerioli); Classification 
and models for interval-valued data (E. Palumbo); Selected Problems in Classifica- 
tion (E. Gatnar); Recent Developments in Multidimensional Data Analysis between 
research and practice I (L. D’Ambra); Recent Developments in Multidimensional 
Data Analysis between research and practice II (B. Simonetti). 

The editors would like to emphatically thank all the section chairs for doing 
such a great job regarding the organization of their sections and the associated paper 
reviews. 

Cordial thanks also go to the members of the scientific program committee for 
their conceptual and practical support as well as for the paper reviews: D. Baier 
(Cottbus), H.-H. Bock (Aachen), H. Bozdogan (Tennessee), J. Buhmann (Zurich), 
H. Burkhardt (Freiburg), A. Cerioli (Parma); R. Decker (Bielefeld), W. Gaul (Karl- 
sruhe), A. Geyer-Schulz (Karlsruhe), P. Groenen (Rotterdam), T. Imaizumi (Tokyo), 
K. Jajuga (Wroclaw), R. Kruse (Magdeburg), S. Lang (Innsbruck), B. Lausen (Erlan- 
gen-Niirnberg), H.-J. Lenz (Berlin), F. Murtagh (London), H. Ney (Aachen), A. 
Okada (Tokyo), L. Schmidt-Thieme (Hildesheim), C. Schnoerr (Mannheim), M. 
Spiliopoulou (Magdeburg), C. Weihs (Dortmund), D. A. Zighed (Lyon). 

Furthermore we would like to thank the additional reviewers: A. Hotho, L. Mar- 
inho, C. Preisach, S. Rendle, S. Scholz, K. Tso. 

The great success of this conference would not have been possible without the 
support of many people mainly working in the backstage. We would like to par- 
ticularly thank M. Temerlnac (Freiburg), J. Fehr (Freiburg), C. Findlay (Freiburg), 
E. Patschke (Freiburg), A. Busche (Hildesheim), K. Tso (Hildesheim), L. Marinho 
(Hildesheim) and the student support team for their hard work in the preparation 
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VII 



of this conference, for the support during the event and the post-processing of the 
conference. 

The GfKl Conference 2007 would not have been possible in the way it took place 
without the financial and/or material support of the following institutions and com- 
panies (in alphabetical order): Albert-Ludwigs-University Freiburg - Faculty of Ap- 
plied Sciences, Gesellschaft ftir Klassifikation e.V., Microsoft Munchen and Springer 
Verlag. We express our gratitude to all of them. Finally, we would like to thank Dr. 
Martina Bihn from Springer Verlag, Heidelberg, for her support and dedication to 
the production of this volume. 



Hildesheim, Freiburg and Bielefeld, February 2008 Christine Preisach 

Hans Burkhardt 
Lars Schmidt-Thieme 
Reinhold Decker 
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Llufs Belanche^ Jean Luis Vazquez^ and Miguel Vazquez^ 

' Dept, de Llenguatges i Sistemes Informatics 
Universitat Politecnica de Catalunya 
08034 Barcelona, Spain 
belanche@lsi.upc . edu 
^ Departamento de Matematicas 
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juanluis.vazquez@uam.es 
^ Dept. Sistemas Informaticos y Programacion 
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28040 Madrid, Spain 
mivazque@fdi.ucm.es 

Abstract. We consider distance-based similarity measures for real-valued vectors of interest 
in kernel-based machine learning algorithms. In particular, a truncated Euclidean similarity 
measure and a self-normalized similarity measure related to the Canberra distance. It is proved 
that they are positive semi-definite (p.s.d.), thus facilitating their use in kernel-based methods, 
like the Support Vector Machine, a very popular machine learning tool. These kernels may be 
better suited than standard kernels (like the RBF) in certain situations, that are described in 
the paper. Some rather general results concerning positivity properties are presented in detail 
as well as some interesting ways of proving the p.s.d. property. 



1 Introduction 

One of the latest machine learning methods to be introduced is the Support Vector 
Machine (SVM). It has become very widespread due to its firm grounds in statistical 
learning theory (Vapnik (1998)) and its generally good practical results. Central to 
SVMs is the notion of kernel function, a mapping of variables from its original space 
to a higher-dimensional Hilbert space in which the problem is expected to be easier. 
Intuitively, the kernel represents the similarity between two data observations. In the 
SVM literature there are basically two common-place kernels for real vectors, one 
of which (popularly known as the RBF kernel) is based on the Euclidean distance 
between the two collections of values for the variables (seen as vectors). 

Obviously not all two-place functions can act as kernel functions. The conditions 
for being a kernel function are very precise and related to the so-called kernel matrix 




4 



Lluis Belanche, Jean Luis Vazquez and Miguel Vazquez 



being positive semi-definite (p.s.d.). The question remains, how should the similarity 
between two vectors of (positive) real numbers be computed? Which of these simi- 
larity measures are valid kernels? There are many interesting possibilities that come 
from well-established distances that may share the property of being p.s.d. There has 
been little work on this subject, probably due to the widespread use of the initially 
proposed kernel and the difficulty of proving the p.s.d. property to obtain additional 
kernels. 

In this paper we tackle this matter by examining two alternative distance-based 
similarity measures on vectors of real numbers and show the corresponding kernel 
matrices to be p.s.d. These two distance-based kernels could better fit some applica- 
tions than the normal Euclidean distance and derived kernels (like the RBF kernel). 
The first one is a truncated version of the standard Euclidean metric in R, which 
additionally extends some of Gower’s work in Gower (1971). This similarity yields 
more sparse matrices than the standard metric. The second one is inversely related 
to the Canberra distance, well-known in data analysis (Chandon and Pinson (1971)). 
The motivation for using this similarity instead of the traditional Euclidean-based 
distance is twofold: (a) it is self-normalised, and (b) it scales in a log fashion, so that 
similarity is smaller if the numbers are small than if the numbers are big. 

The paper is organized as follows. In Section 2 we review work in kernels and 
similarities defined on real numbers. The intuitive semantics of the two new kernels 
is discussed in Section 3. As main results, we intend to show some interesting ways 
of proving the p.s.d. property. We present them in full in Sections 4 and 5 in the 
hope that they may be found useful by anyone dealing with the difficult task of 
proving this property. In Section 6 we establish results for positive vectors which 
lead to kernels created as a combination of different one-dimensional distance-based 
kernels, thereby extending the RBF kernel. 



2 Kernels and similarities defined on real numbers 

We consider kernels that are similarities in the classical sense: strongly reflexive, 
symmetric, non-negative and bounded (Chandon and Pinson (1971)). More specifi- 
cally, kernels k for positive vectors of the general form: 



^(x,y) = ,/■ 









( 1 ) 



where Xj^yj belong to some subset of R, are metric distances and 

appropriate continuous and monotonic functions in /?+ U {0} mak- 
ing the resulting k a valid p.s.d. kernel. In order to behave as a similarity, a natural 
choice for the kernels k is to be distance-based. Almost invariably, the choice for 
distance-based real number comparison is based on the standard metric in R. The 
aggregation of a number n of such distance comparisons with the usual 2-norm 
leads to Euclidean distance in /?". It is known that there exist inverse transformations 
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of this quantity (that can thus be seen as similarity measures) that are valid kernels. 
An example of this is the kernel: 

I lx — vl P 

k(x,y) = exp{- ' }, x,y€R",a^0€R, (2) 

popularly known as the RBF (or Gaussian) kernel. This particular kernel is ob- 
tained by taking d{xj,yj) = \xj —yj\,gj{z) = z^/(2ay) for non-zero and f{z) = 
Note that nothing prevents the use of different scaling parameters ay for every 
component. The decomposition need not be unique and is not necessarily the most 
useful for proving the p.s.d. property of the kernel. 

In this work we concentrate on upper-bounded metric distances, in which case 
the partial kernels gj{dj{xj,yj)) are lower-bounded, though this is not a necessary 
condition in general. We list some choices for partial distances: 

dTrE{xi,yi) = vcA'z^{'^,\xi—yi\} (Truncated Euclidean) (3) 



. / N \xi-yi\ 

^Cani^ifyi) — 

Xi + yi 


(Canberra) 


(4) 


./ X \xi-yi\ 

a[Xi,yi) = 

max(x,',y,) 


(Maximum) 


(5) 


^ (xi-yif 
d[Xi,yi) = 

Xi+yi 


(squared 


(6) 



Note the first choice is valid in R, while the others are valid in /?+. There is some 
related work worth mentioning, since other choices have been considered elsewhere: 
with the choice gj{z) = 1 — z, a kernel formed as in (1) for the distance (5) appears 
as p.s.d. in Shawe-Taylor and Cristianini (2004). Also with this choice for gj, and 
taking /(z) = > 0 the distance (6), leads to a kernel that has been proved 

p.s.d. in Fowlkes et al. (2004). 



3 Semantics and applicability 

The distance in (3) is a truncated version of the standard metric in R, which can 
be useful when differences greater than a specihed threshold have to be ignored. 
In similarity terms, it models situations wherein data examples can become more 
and more similar until they are suddenly indistinguishable. Otherwise, it behaves 
like the standard metric in R. Notice that this similarity may lead to more sparse 
matrices than those obtainable with the standard metric. The distance in (4) is called 
the Canberra distance (for one component). It is self-normalised to the real interval 
[0,1), and is multiplicative rather than additive, being specially sensitive to small 
changes near zero. Its behaviour can be best seen by a simple example: let a variable 
stand for the number of children, then the distance between 7 and 9 is not the same 
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“psychological” distance than that between 1 and 3 (which is triple); however, |7 — 
9| = 1 1 — 3|. If we would like the distance between 1 and 3 be much greater than that 
between 7 and 9, then this effect is captured. More specihcally, letting z = xjy, then 
dcan{x,y) =g(z), where g(z) = |z- l|/(z+l) and thus g(z) =g(l/z). The Canberra 
distance has been used with great success in content-based image retrieval tasks in 
Kokare et al. (2003). 



4 Truncated Euclidean similarity 

Let Xj be an arbitrary hnite collection of n different real points Xi G R, i = I, . . . ,n. 
We are interested in the nx n similarity matrix A = (a,y) with 

aij = I- dij, dij = min{ 1 , |x,- - xy | } , (7) 

where the usual Euclidean distances have been replaced by truncated Euclidean dis- 
tances. We can also write aij = {l—dij)+ = max{0, 1 — \xi~Xj\}. 

Theorem 1. The matrix A is positive definite (p.s.d.). 

Proof. We define the bounded functions X, (x) for x G R with value 1 if k-x,-| < 
1 /2, zero otherwise. We calculate the interaction integrals 

lij = [ Xi{x)Xj{x)dx . 

Jr 

The value is the length of the interval [x,- — l/2,x, ■ + 1 /2] n [x/ — 1 /2,xj + 1/2] . It is 
easy to see that Uj = 1 — dij if dij < 1, and zero if |x, — xy] > 1 (i.e., when there is no 
overlapping of supports). Therefore, lij = aij if i j. Moreover, for i = j we have 

J Xi{x)Xj{x)dx = J Xf{x)dx= 1. 

We conclude that the matrix A is obtained as the interaction matrix for the system of 
functions . These interactions are actually the dot products of the functions in 

the functional space L?{R). Since j is the dot product of the inputs cast into some 
Hilbert space it forms, by definition, a p.s.d. matrix. 

Notice that rescaling of the inputs would allow us to substitute the two “1” (one) in 
equation (7) by any arbitrary positive number. In other words, the kernel with matrix 

aij = {s-dij)+ = max{0,j- |x, -X;|} (8) 

with 5 > 0 is p.s.d. The classical result for general Euclidean similarity in Gower 
(1971) is a consequence of this Theorem when |x, — X; | < 1 for all i,j. 
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5 Canberra distance-based similarity 



We define the Canberra similarity between two points as follows 



Scani^i')^ j) — 1 ■) — j (9) 

Xj Xj 

where dcan{xi,Xj) is called the Canberra distance, as in (4). We establish next 
the p.s.d. property for Canberra distance matrices, forx,,x/ G 

Theorem 2. The matrix A = {atj) with atj = Scan{xi,Xj) is p.s.d. 

Proof. First step. Examination of equation (9) easily shows that for any Xi,xj G 
(not including 0) the value of scan(xi,xj) is the same for every pair of points x,-,x/ 
that have the same quotient Xi/xj. This gives us the idea of taking logarithms on the 
input and finding an equivalent kernel for the translated inputs. From now on, define 
X = x,-,z = xj, for clarity. We use the following straightforward result: 

Lemma 1. Let K' be a p.s.d. kernel defined in the region B x B, let <I> be map from a 
region A into B, and let K be defined on Ax A as K(x, z) = A''(<I>(x) , <I>(z)). Then the 
kernel K is p.s.d. 



Proof. Clearly is a restriction of B, and K' is p.s.d in all BxB. 



Here, we take K = Scan, ^ = 
what K' would be by defining f = 
rewritten as 

dcan{x,z) 



<I>(x) = log(x), so that B is R. We now find 
log(x) , z' = log(z) , so that distance dcan can be 



x + z 



As we noted above, dcan{x,z) is equivalent for any pair of points x,z G with 
the same quotients x/z or z/x. Assuming that x > z without loss of generality, we 
write this as a translation invariant kernel by introducing the increment in logarith- 
mic coordinates h=\x' —z' \= f — z! = \og{x/ z)'. 



dcan{x,z) 



e^ e^ — e^ 
e^'e’' + e^' 



e*-l 

e*+r 



Substitution onK = Scan gives 



Scan{x,z) = 1 - 



e*- 1 
e*+ 1 



2 

e'^ + l 



Therefore, for f ,z! G R, f = z! + h, we have 






2 

e^ + I 



F{h). 



( 10 ) 



Note that E is a convex function of h G [0,>=°) with E(0) = 1, E(°°) = 0. 
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Second step. To prove our theorem we now only have to prove the p.s.d. property for 
kernel K' satisfying equation (10). 

A direct proof uses an integral representation of convex functions that proceeds 
as follows. Given a twice continuously differentiable function F of the real variable 
i > 0, integrating by parts we find the formula 



F{x) = — f F'{s)ds= [ F''{s){s — x)ds, 

J X J X 

valid for all x > 0 on the condition that F {s) and sF'{s) ^ 0 as ^ ^ <=°. The formula 
can be written as ^ 

F{x) = / F" {s){s — x)+ds, 

Jo 

which implies that whenever F" > 0, we have expressed F (x) as an integral combina- 
tion with positive coefficients of functions of the form (i — x)+. This is a non-trivial, 
but commonly used, result in convex theory. 

Third step. The functions of the form (i — x)+ are the building blocks of the Trun- 
cated Euclidean Similarity kernels (7). Our kernel K' is represented as an integral 
combination of these functions with positive coefficients. In the previous Section we 
have proved that functions of the form (8) are p.s.d. We know that the sum of p.s.d. 
terms is also p.s.d., and the limit of p.s.d. kernels is also p.s.d. Since our expression 
for K' is, like all integrals, a limit of positive combinations of functions of the form 
(s — x)+, the previous argument proves that equation (10) is p.s.d., and by Lemma 1 
our theorem is proved. More precisely, what we say is that, as a convex function, F 
can be arbitrarily approximated by sums of functions of the type 

X 

/„(x) =max{0,a„ } (11) 

r» 

for nG [0, . . . , A] , and the r„ equally spaced in the range of the input (so that the bigger 
the N the closer we get to (10)). Therefore, we can write 



2 

e^TT 



n 

lim - 

n^oo < ^ 

1=0 




( 12 ) 



where each term in the succession (12) is of the form (11), equivalent to (8). 



6 Kernels defined on real vectors 

We establish now a result for positive vectors that leads to kernels analogous to the 
Gaussian RBF kernel. The reader can find useful additional material on positive and 
negative definite functions in Berg et al. 1984 (esp. Ch. 3). 

Definition 1 (Hadamard function). If A = [atj] is a nx n matrix, the function f : 
A — > /(A) = [f{aij)] is called a Hadamard function (actually, this is the simplest 
type of Hadamard function). 
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Theorem 3. Let a p.s.d. matrix A = [a,y] and a Hadamard function f be given. If 
f is an analytic function with positive radius of convergence R > |a,y| and all the 
coefficients in its power series expansion are non-negative, then the matrix f{A) is 
p.s.d. as proved in Horn and Johnson ( 1991 ). 

Definition 2 (p.s.d. function). A real symmetric function f(x,y) of real variables 
will be called p.s.d. if for any finite collection ofn real numbers xi, ...,x„, the nx n 
matrix A with entries atj = f{xi,Xj) is p.s.d. 

Lemma 2. Let b > I € R,c € R and let c — f{x,y) be a p.s.d. function. Then 
is a p.s.d. function. 

Proof. The function x ^ (A is analytic with infinite radius of convergence and all the 
coefficients in its power series expansion are non-negative in case b> 1 . By theorem 
(3) the function is p.s.d.; then so is and consequently is 

p.s.d. (since ¥ is a positive constant). 

Theorem 4. The following function 

k{x,y) = exp ^ Xi,yi,Ui G /?+ 

where d is any of (3), (4), (5), is a valid p.s.d. kernel. 

Proof. For simplicity, make di = d{xi,yi). We know \ —di is a p.s.d. function, for the 
choices of d{ defined in (3), (4), (5). Therefore, ( 1 — tf,) / a; for a,- > 0 G is also p.s.d. 
Making c = / = di/Oi, hy lemma (2), the function exp{—di/ai) is 

n 

p.s.d. The product of p.s.d. functions is p.s.d., and thus Y\exp{—di/Gi) = 

i=i 

exp E is P-S-d- 

This result is useful since it establishes new kernels analogous to the Gaussian 
RBF kernel but based on alternative metrics. Computational considerations should 
not be overlooked: the use of the exponential function considerably increases the 
cost of evaluating the kernel. Hence, kernels not involving this function are specially 
welcome. 

Proposition 1. Let d{xi,xj) = be the Canberra distance. Then k(xi,xf) = 1 — 
d{xi,Xj ) /a is a valid p.s.d. kernel if and only if a > 1. 

Proof. Let dij = d{xi,Xj). We know Ei=i E/=i eiCj{l — dij) > 0 for all a,Cj G 
R. We have to show that EILi E/=i ~ This can be expressed as 

This result is a generalization of Theorem (2), valid for o = 1. It is immediate 
that the following function (the Canberra kernel) is a valid kernel: 
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yj { ^ 



G, > 1 



;=1 



a,- 



The inclusion of the o, (acting as learning parameters) has the purpose of adding 
flexibility to the models. Concerning the truncated Euclidean distance, a correspond- 
ing kernel can be obtained in a similar way. Let d{xi,Xj) = min{l, \xi—Xj\] and de- 
note for a real number a,a+=\— min{l,a) = max{0, 1—a). Then a — min{a, \xi — 

.1 

XjW is p.s.d. by Theorem (1) and so is max{0, 1 — In consequence, it is im- 

mediate to affirm that the following function (the Truncated Euclidean kernel) is 
again a valid kernel: 



^(x,y) = 




G, > 0 



7 Conclusions 

We have considered distance-based similarity measures for real-valued vectors of 
interest in kernel-based methods, like the Support Vector Machine. The first is a 
truncated Euclidean similarity and the second a self-normalized similarity. Derived 
real kernels analogous to the RBE kernel have been proposed, so the kernel toolbox 
is widened. These can be considered as suitable alternatives for a proper modeling of 
data affected by multiplicative noise, skewed data and/or containing outliers. In addi- 
tion, some rather general results concerning positivity properties have been presented 
in detail. 
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Abstract. In many classification applications. Support Vector Machines (SVMs) have proven 
to be highly performing and easy to handle classifiers with very good generalization abilities. 
However, one drawback of the SVM is its rather high classification complexity which scales 
linearly with the number of Support Vectors (SVs). This is due to the fact that for the classi- 
fication of one sample, the kernel function has to be evaluated for all SVs. To speed up clas- 
sification, different approaches have been published, most which of try to reduce the number 
of SVs. In our work, which is especially suitable for very large datasets, we follow a different 
approach: as we showed in (Zapien et al. 2006), it is effectively possible to approximate large 
SVM problems by decomposing the original problem into linear subproblems, where each 
subproblem can be evaluated in £2(1). This approach is especially successful, when the as- 
sumption holds that a large classification problem can be split into mainly easy and only a few 
hard subproblems. On standard benchmark datasets, this approach achieved great speedups 
while suffering only sightly in terms of classification accuracy and generalization ability. In 
this contribution, we extend the methods introduced in (Zapien et al. 2006) using not only 
linear, but also non-linear subproblems for the decomposition of the original problem which 
further increases the classification performance with only a little loss in terms of speed. An 
implementation of our method is available in (Ronneberger and et al.) Due to page limitations, 
we had to move some of theoretic details (e.g. proofs) and extensive experimental results to a 
technical report (Zapien et al. 2007). 



1 Introduction 

In terms of classification-speed, SVMs (Vapnik 1995) are still outperformed by many 
standard classifiers when it comes to the classification of large problems. For a non- 
linear kernel function k, the classification function can be written as in Eq. (1). Thus, 
the classification complexity lies in £2(n) for a problem with n SVs. However, for 
linear problems, the classification function has the form of Eq. (2), allowing clas- 
sification in £2(1) by calculating the dot product with the normal vector w of the 
hyperplane. In addition, the SVM has the problem that the complexity of a SVM 
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model always scales with the most difficult samples, forcing an increase in Support 
Vectors. However, we observed that many large scale problems can easily be divided 
in a large set of rather simple subproblems and only a few difficult ones. Following 
this assumption, we propose a classification method based on a tree whose nodes 
consist mostly of linear SVMs (Fig.(l)). 



/(x) = sign [ y^jiaik{xi,x)+b 



K i=l 



fix) = sign i{w,x)+b) 



( 1 ) 

( 2 ) 



This paper is structured as follows: first we give a brief overview of related work. 
Section 2 describes our initial linear algorithm in detail including a discussion of the 
zero solution problem. In section 3, we introduce a non-linear extension to our initial 
algorithm, followed by Experiments in section 4. 




Fig. 1. Decision tree with linear SVM 



1.1 Related work 

Recent work on SVM classification speedup mainly focused on the reduction of the 
decision problem: A method called RSVM (Reduced Support Vector Machines) was 
proposed by Lee and Mangasarian (2001), it preselects a subset of training samples 
as SVs and solves a smaller Quadratic Programming problem. Lei and Govindaraju 
(2005) introduced a reduction of the feature space using principal component anal- 
ysis and Recursive Feature Elimination. Burges and Schoelkopf (1997) proposed a 
method to approximate w by a list of vectors associated with coefficients a,-. All these 
methods yield good speedup, but are fairly complex and computationally expensive. 
Our approach, on the other hand, was endorsed by the work of Bennett and Breden- 
steiner (2000) who experimentally proved that inducing a large margin in decision 
trees with linear decision functions improved the generalization ability. 
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2 Linear SVM trees 

The algorithm is described for binary problems, an extension to multiple-class prob- 
lems can be realized with different techniques like one vs. one or one vs. rest (Hsu 
and Lin 2001) (Zapien et al. 2007). 

At each node i of the tree, a hyperplane is found that correctly classifies all sam- 
ples in one class (this class will be called the “hard"’ class, denoted hcj). Then, all 
correctly classified samples of the other class (the “soft" class) are removed from 
the problem, Fig. (2). The decision of which class is to be assigned “hard" is taken 



Fig. 2. Problem fourclass (Schoelkopf and Smola 2002). Left: hyperplane for the first node. 
Right: Problem after first node (“hard" class = triangles). 

in a greedy manner for every node (Zapien et al. 2007). The algorithm terminates 
when the remaining samples all belong to the same class. Fig. (3) shows a training 
sequence. We will further extend this algorithm, but first we give a formalization for 
the basic approach. 

Problem Statement. Given a two class problem with m = m\+ samples x,- € M" 
with labels y,, i € CC and CC = Without loss of generality we define a 

Class 1 (Positive Class) CCi = {1, y, = 1 for all i G CCi, with a global pe- 

nalization value Di and individual penalization values Ci = D\ for all i G CCi as well 
as an analog Class -1 (Negative Class) CC_i = {mi + l,...,mi -Fm_i}, y,- = —1 for 
all i G CC I , with a global penalization value D i and individual penalization values 
C, = D 1 for all i G CC i . 

2.1 Zero vector as solution 

In order to train a SVM using the previous definitions, taking one class to be “hard" 
in a training step, e.g. CC_i is the “hard" class, one could simply set D i — > and 
Z>i << D I in the primal SVM optimization problem: 




wettf.beR.^eR'" 

subject to y;((x/,w) -fZ?) > 1-^,-, i= l,..,m, 



minimize 




(3) 



>0, i= 1, ..,OT. 



(4) 

(5) 



Unfortunately, in some cases the optimization process converges to a trivial solu- 
tion: the zero vector. We used the convex hull interpretation of SVMs (Bennett and 
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Fig. 3. Sequence (left to right) of hyperplanes for nodes 1-6 of the tree. 



Bredensteiner 2000), in order to determine under which circumstances the trivial so- 
lution is occurring and proved the following theorems (Zapien et al. 2007): 

Theorem 1: If the convex hull of the “hard" class CCi intersects the convex hull of 
the “soft" class CC_i, then w = 0 is a feasible point for the primal Problem (4) if 
D I > maxjgcci {^i} • Di, where Z, are such that 



is a convex combination for a point p that belongs to both convex hulls. 

Theorem 2: If the center of gravity s_i of class CC_i is inside the convex hull of 
class CCi, then it can be written as 



with Zj- > 0 for all i G CCi and X);ecCi ~ additionally, Di > ZmaxO_i» 2 _i, 
where Zmax = maXiGCCi then w = 0 is a feasible point for the primal Problem. 

Please refer to (Zapien et al. 2007) for detailed proofs of both theorems. 

2.2 Hl-SVM problem formulation 

To avoid the zero vector, we proposed a modification of the original SVM optimiza- 
tion problem, which is taking advantage of the previous theorems: the Hl-SVM (HI 
for one hard class). 








and 
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Hl-SVM Primal Problem 

5 w) + b) (6) 

subject to y,((x,',w) +b)>\ for all i C CQ, (7) 

where k=\ and k = —1, or A: = — 1 and k=\. 

This new formulation constraints Eq. (7) to classify all samples in the class CC^; per- 
fectly, forcing a “hard" convex hull (HI) for CCa;. The number of misclassification 
on the other class CC^ is added to the objective function, hence the solution is a 
trade-off between a maximal margin and a minimum number of misclassifications in 
the “soft" class CC^. 

Hl-SVM Dual Formulation 

max YZi - 2 Y.tj=i aiajyiyj{xi,xj) 
subject to 0 < a, < Q, i € CQ, 

Uj = 1, j e cci, 

E”iam = 0, 

where A: = 1 and A:=— l,orA:=— 1 and k=\. 

This problem can be solved in a similar way as the original SVM Problem using the 
SMO algorithm (Schoelkopf and Smola 2002)(Zapien et al. 2007), and adding some 
modifications to force a,- = 1 Vi G CC^. 

Theorem 3: For the Hl-SVM the zero solution can only occur if |CCjt| > (n — 1) and 
there exists a linear combination of the sample vectors in the “hard" class x,- G CCk 
and the sum of the sample vectors in the “soft" class, X^igcc^X; - 
Proof: Without loss of generality, let the “hard" class be class CCi. Then, 

m 

w = ^ a,7,x,- = ^ a,x; - ^ a,x,- 

f=l /GCC_j 

= a,x, - Y X,-- (12) 

iGCCj iGCC_j 

If we define z,- = X^ieCC i Xi and |CCi | > (n — 1) = dim{zi) — 1, there exist {a,}, / G 
CCi , a; ^ 0 such that 

W = ^ a,X; - Zi = 0. 

The usual threshold calculation ((Keerthi et al. 1999) and (Schoelkopf and Smola 
2002)) can no longer be used to define the hyperplane, please refer to (Zapien et al. 
2007) for details on the threshold computation. 

The basic algorithm can be improved with some heuristics for greedy ”hard"-class 
determination and tree pruning, shown in (Zapien et al. 2007). 



( 8 ) 

(9) 

( 10 ) 

( 11 ) 
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3 Non-linear extension 

In order to classify a sample, one simply runs it down the SVM-tree. When using 
only linear nodes, we already obtained good results (Zapien et al. 2006), but we also 
observed that first of all, most errors occur in the last node, and second, that over all 
only a few samples will reach the last node during the classification procedure. This 
motivated us to add a non-linear node (e.g. using RBF kernels) to the end of the tree. 
Training of this extended SVM-tree is analogous to the original case. First a pure 




Fig. 4. SVM tree with non-linear extesion 



linear tree is build. Then we use a heuristic (trade-off between average classification 
depth and accuracy) to move the final, non-linear node from the last node up the tree. 
It is very important to notice, that to avoid overfitting, the final non-linear SVM has 
to be trained on the entire initial training set, and not only on the samples remain- 
ing after the last linear node. Otherwise the final node is very likely to suffer from 
strong overfitting. Of cause, then the final model will have many SVs, but since only 
a few samples will reach the final node, our experiments indicate that the average 
classification depth will be hardly affected. 



4 Experiments 

In order to show the validity and classification accuracy of our algorithm we per- 
formed a series of experiments on standard benchmark data sets. These experiments 
were conducted* e.g. on Faces (Carbonetto) (9172 training samples, 4262 test sam- 
ples, 576 features) and USPS (Hull 1994) (18063 training samples, 7291 test sam- 
ples, 256 features) as well as on several other data sets. More and detailed exper- 
iments can be found in (Zapien et al. 2007). The data was split into training and 
test sets and normalized to minimum and maximum feature values (Min-Max) or 
standard deviation (Std-Dev). 

* These experiments were run on a computer with a P4, 2.8 GHz and IG in Ram. 
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Faces 

(Min-Max) 


RBF 

Kernel 


Hl-SVM 


Hl-SVM 

Gr-Heu 


RBF/Hl 


RBF/Hl 

Gr-Heu 


Nr. SVs or 
Hyperplanes 


2206 


4 


4 


551.5 


551.5 


Training Time 


14:55.23 


10:55.70 


14:21.99 


1.37 


1.04 


Classification Time 


03:13.60 


00:14.73 


00:14.63 


13.14 


13.23 


Classif. Accuracy % 


95.78 % 


91.01 % 


91.01 % 


1.05 


1.05 



USPS 

(Min-Max) 


RBF 

Kernel 


Hl-SVM 


Hl-SVM 

Gr-Heu 


RBF/Hl 


RBF/Hl 

Gr-Heu 


Nr. SVs or 
Hyperplanes 


3597 


49 


49 


73.41 


73.41 


Training Time 


00:44.74 


00:22.70 


02:09.58 


1.97 


0.35 


Classification Time 


01:58.59 


00:19.99 


00:20.07 


5.93 


5.91 


Classif. Accuracy % 


95.82 % 


93.76 % 


93.76 % 


1.02 


1.02 



Comparisons to related work are difficult, since most publications (Bennett and Bre- 
densteiner 2000), (Lee and Mangasarian 2001) used datasets with less than 1000 
samples, where the training and testing time are negligible. In order to test the per- 
formance and speedup on very large datasets, we used our own Cell Nuclei Database 
(Zapien et al. 2007) with 3372 training samples, 32 features each, and about 16 mil- 
lion test samples: 





RBF-Kernel 


linear tree 
Hl-SVM 


non-linear tree 
Hl-SVM 


training time 




s:i3s 


^5s 


Nr. SVs or 
Hyperplanes 


980 


86 


86 


average classification 
depth 


■ 


7.3 


8.6 


classifiaction time 
accuracy 


Ril.5h 

97.69% 


~2 min 
95.43% 


~2 min 
97.5% 



5 Conclusion 

We have presented a new method for fast SVM classification. Compared to non- 
linear SVM and speedup methods our experiments showed a very competitive 
speedup while achieving reasonable classification results (loosing only marginal 
when we apply the non-linear extension compared to non-linear methods). Espe- 
cially if our initial assumption holds , that large problems can be split in mainly easy 
and only a few hard problems, our algorithm achieves very good results. The ad- 
vantage of this approach clearly lies in its simplicity since no parameter has to he 
tuned. 
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Abstract. In the last decade the classifier ensembles have enjoyed a growing attention and 
popularity due to their properties and successful applications. 

A number of combination techniques, including majority vote, average vote, behavior- 
knowledge space, etc. are used to amplify correct decisions of the ensemble members. But the 
key of the success of classifier fusion is diversity of the combined classifiers. 

In this paper we compare the most commonly used combination rules and discuss their 
relationship with diversity of individual classifiers. 



1 Introduction 

Fusion of multiple classifiers is one of the recent major advances in statistics and ma- 
chine learning. In this framework, multiple models are built on the basis of training 
set and combined into an ensemble or a committee of classifiers. Then the component 
models determine the predicted class. 

Classifier ensembles proved to be high performance classification systems in nu- 
merous applications, e.g. pattern recognition, document analysis, personal identifi- 
cation, data mining etc. 

The high accuracy of the ensemble is achieved if its members are “weak" and di- 
verse. The term “weak” refers to unstable classifiers, such as classification trees, and 
neural nets. Diversity means that the classifiers are different from each other (inde- 
pendent, uncorrelated). This is usually obtained by using different training subsets, 
assigning different weights to instances or selecting different subsets of features. 

Turner and Ghosh (1996) have shown that the ensemble error decreases with the 
reduction in correlation between component classifiers. Therefore, we need to assess 
the level of indpendence of the members of the ensemble, and different measures of 
diversity have been proposed so far. 

The paper is organised as follows. In Section 2 we give some basics on classi- 
fier fusion. Section 3 contains a short description of selected diversity measures. In 
Section 4 we discuss the fusion methods (combination rules). The problems related 
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to assessment of performance of combination rules and their relationship with diver- 
sity measures are presented in Section 5. Section 6 gives a brief description of our 
experiments and the obtained results. The last section contains some conclusions. 



2 Classifier fusion 

A classifier C is any mapping C :X^Y from the feature space X into a set of class 
labels Y = {Zi,/ 2 , . . . ,/y}. 

The classifier fusion consists of two steps. In the first step the set of M in- 
dividual classifiers {Ci,C 2 , . . . ,Cm} is designed on the basis of the training set 
T = {(xi,yi), (X 2 ,y 2 ), • • • , (xiv,yw)}. 

Then, in the second step, their predictions are combined into an ensemble C* 
using a combination function F: 

C* = F{Ci,C2,---,Cm). (1) 

Various combinatorial rules have been proposed in the literature to approximate the 
function F, and some of them will be discussed in Section 4. 



3 Diversity of ensemble members 

In order to assess the mutual independence of individual classifiers, different mea- 
sures have been proposed. The simplest ones are pairwise measures defined between 
two classifiers, and the overall diversity of the ensemble is the average of the diver- 
sities (p) between all pairs of the ensemble members: 

2 M-l M 

DiversityiC*) = _ ^ ^ p{m,k). (2) 

m=\ k=m^l 

The relationship between a pair of classifiers C, and Cj can be shown in the form 
of the 2x2 contingency table (Table 1). 



Table 1. A 2 X 2 contingency table for the two classifier outputs. 



Classifiers 


Cj is correct 


Cj is wrong 


C; is correct 


a 


b 


Ci is wrong 


c 


d 



The well known measure of classifier dependence is the binary version of the 
Pearson’s correlation coefficient: 



r{ij) 



ad — be 

\/{a + b){c + d){a + c){b + d) 



(3) 
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Partridge and Yates (1996) have used a measure named within-set generalization 
diversity. This measure is simply the kappa statistics: 






2{ac — bd) 



Skalak (1996) reported the use of the disagreement measure: 

b + c 



DM{iJ) = 



a -\- b -\- c -\- d 



( 4 ) 



( 5 ) 



Giacinto and Roll (2000) have introduced a measure based on the compound 
error probability for the two classifiers, and named compound diversity. 



CD{iJ) = 



a -\- b -\- c -\- d 



( 6 ) 



This measure is also named “double-fault measure” because it is the proportion of 
the examples that have been misclassified by both classifiers. 

Kuncheva et al. (2000) strongly recommended the Yule’s Q statistics to evaluate 
the diversity: 



Q{iJ) 



ad — be 
ad + be 



( 7 ) 



Unfortunately, this measure has two disadvantages. In some cases its value may be 
undefined, e.g. when a = 0 and b = 0, and it cannot distinguish between different 
distributions of classifier outputs. 

In order to overcome the drawbacks of the Yule’s Q statistics, Gatnar (2005) 
proposed the diversity measure based on the Hamann’s coefficient: 



H{iJ) 



{a-\-d) — (b + c) 
a b c d 



( 8 ) 



Several non-pairwise measures have been also developed to evaluate the level of 
diversity between all members of the ensemble. 

Cunningham and Carney (2000) suggested using the entropy function: 



EC=- 



1 

N 



N 

^L(x,)/og(L(x,)) 

/=1 



^ ^ (M - L{xi))log{M -L{xi)), 



( 9 ) 



where L{x) is the number of classifiers that correctly classified the observation x. Its 
simplified version was introduced by Kuncheva and Whitaker (2003): 



E 



1 ^ 
-T 

N ^ 



^M-[M/21 



min{L{xi),M-L{xi)}. 



( 10 ) 



Kohavi and Wolpert (1996) used their variance to evaluate the diversity: 
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KW = 



1 



N 

^L(x,')(M-L(x,')). 

1=1 



( 11 ) 



Also Dietterich (2000) proposed the measure to assess the level of agreement 
between classifiers. It is the kappa statistics: 



MEf=i^x,)(_M-L(x,)) 

^(M-l)p(l-^ 



( 12 ) 



Hansen and Salamon (1990) introduced the measure of difficulty 0. It is simply 
the variance of the random variable Z = L(x) /M: 



Q = Var{Z). 



(13) 



Two measures of diversity have been proposed by Partridge and Krzanowski 
(1997) for evaluation of the software diversity. The first one is the generalized di- 
versity measure: 



GD=\- 



P(2) 

P(l)’ 



(14) 



where p{k) is the probability that k randomly chosen classifiers will fail on the ob- 
servation X. The second measure is named coincident failure diversity: 



CFD = 



0 where po = 1 



(15) 



where p^ is the probability that exactly m out of M classifiers will fail on an obser- 
vation X. 



4 Combination rules 



Once we have produced the set of individual classifiers of desired level of diversity, 
we combine their predictions to amplify their correct decisions and cancel out the 
wrong ones. The combination function F in (1) depends on the type of the classifier 
outputs. 

There are three different forms of classifier output. The classifier can produce a 
single class label (abstract level), rank the class labels according to their posterior 
probabilities (rank level), or produce a vector of posterior probabilities for classes 
(measurement level). 

Majority voting is the most popular combination rule for class labels' : 



C* (x) = arg max 
j 



M 

^~~^/(Cm(x) 



xm=l 




(16) 



* In the R statistical environment we obtain class labels using the command 
predict ( . . . , type=" class" ) . 
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It can be proved that it is optimal if the number of classifiers is odd, they have the 
same accuracy, and the classifier’s outputs are independent. If we have evidence that 
certain models are more accurate than others, weighing the individual predictions 
may improve the overall performance of the ensemble. 

Behavior Knowledge Space developed by Huang and Suen (1995) uses a look-up 
table that keeps track of how often each class combination is produced by the clas- 
sifiers during training. Then, during testing, the winner class is the most frequently 
observed class in the BKS table for the combination of class labels produced by the 
set of classifiers. 

Wernecke (1992) proposed a method similar to BKS, that uses the look-up table 
with 95% confidence intervals of the class frequencies. If the intervals overlap, the 
least wrong classifier gives the class label. 

Naive Bayes combination introduced by Domingos and Pazzani (1997) also 
needs training to estimate the prior and posterior probabilities: 

M 

m=\ 

Finally, the class with the highest value of Sj{x) is chosen as the ensemble prediction. 

On the measurement level, each classifier produces a vector of posterior probabil- 
ities^ Cm(x) = [cmi(x),Cm 2 (x), . • • ,Cm/(x)]. And Combining predictions of all models, 
we have a matrix called decision profile for an instance x: 

Cll(x) Cl 2 (x) ... Ciy(x) 

DP{x) = (18) 

_Cmi(x) Cm2(x) ... Cmj{x) _ 

Based on the decision profile we calculate the support for each class (sj{x)), and 
the final prediction of the ensemble is the class with the highest support: 

C*(x) = argmax{ij(x)} . (19) 

The most commonly used is the average (mean) rule: 

1 " 

= (70) 

m=l 

There are also other algebraic rules that calculate median, maximum, minimum and 
product of posterior probabilities for the j-th class. For example, the product rule is: 

1 " 

= (21) 

m=l 

Kuncheva et al. (2001) proposed a combination method based on Decision Tem- 
plates, that are averaged decision profiles for each class (DTj). Given an instance x, 

^ We use the command predict ( . . . , type="prob" ) . 
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its decision profile is compared to the decision templates of each class, and the class 
whose decision template is closest (in terms of the Euclidean distance) is chosen as 
the ensemble prediction: 



^ M J 

Jy(x) = 1 -— (22) 
m=\ k=\ 

There are other combination functions using more sophisticated methods, such 
as fuzzy integrals (Grabisch, 1995), Dempster-Shafer theory of evidence (Rogova, 
1994) etc. 

The rules presented above can be divided into two groups: trainable and non- 
trainable. In trainable rules we determine the values of their parameters using the 
training set, e.g. cell frequencies in the BKS method, or Decision Templates for 
classes. 



5 Open problems 

There are several problems that remain open in classifier fusion. In this paper we 
only focus on two of them. We have shown above ten combination rules, so the 
first problem is the search for the best one, i.e. the one that gives the more accurate 
ensembles. 

And the second problem is concerned with the relationship between diversity 
measures and combination functions. If there is any, we would be able to predict the 
ensemble accuracy knowing the level of diversity of its members. 



6 Results of experiments 

In order to find the best combination rule and determine relationship between com- 
bination rules and diversity measures we have used 10 benchmark datasets, divided 
into learning and test parts, as shown in Table 2. 

For each dataset we have generated 100 ensembles of different sizes: M = 
10, 20, 30, 40, 50, and we used classification trees^ as the base models. 

We have computed the average ranks for the combination functions, where rank 
1 was for the best rule, i.e. the one that produced the most accurate ensemble, and 
rank 10 - for the worst one. The ranks are presented in Table 3. 

We found that the mean rule is simple and has consistent performance for the 
measurement level, and majority voting is a good combination rule for class labels. 
Maximum rule is too optimistic, while minimum rule is too pessimistic. 

If the classifier correctly estimates the posterior probabilities, the product rule 
should be considered. But it is sensitive to the most pessimistic classifier. 

^ In order to grow trees, we have used the Rpart procedure written by Therneau and Atkinson 
(1997) for the R environment. 
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Table 2. Benchmark datasets. 



Dataset 


Number of cases 
in training set 


Number of cases 
in test set 


Number of 
predictors 


Number 
of classes 


DNA 


2124 


1062 


180 


3 


Letter 


16000 


4000 


16 


26 


Satellite 


4290 


2145 


36 


6 


Iris 


100 


50 


4 


3 


Spam 


3000 


1601 


57 


2 


Diabetes 


512 


256 


8 


2 


Sonar 


138 


70 


60 


2 


Vehicle 


564 


282 


18 


4 


Soybean 


455 


228 


34 


19 


Zip 


7291 


2007 


256 


10 



Table 3. Average ranks for combination methods. 



Method 


Rank 


mean 


2.98 


vote 


3.50 


prod 


4.73 


med 


4.91 


min 


6.37 


bayes 


6.42 


max 


7.28 


DT 


7.45 


Wer 


7.94 


BKS 


8.21 



Figure 1 illustrates the comparison of performance of the combination functions 
for the Spam dataset, which is typical of the datasets used in our experiments. We 
can observe that the fixed rules perform better than the trained rules. 




Fig. 1. Boxplots of combination rules for the Spam dataset. 
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We have also noticed that the mean, median and vote rules give similar results. 
Moreover, cluster analysis has shown that there are three more groups of rules of 
similar performance: minimum and maximum, Bayes and Decision Templates, BKS 
and Wernecke’s combination method. 

In order to find the relationship between the combination functions and the di- 
versity measures, we have calculated Pearson correlations. Correlations are moderate 
(greater than 0.4) between mean, mode, product, and vote rules and Compound Di- 
versity (6) as the only pairwise measure of diversity. 

For non-pairwise measures correlations are strong (greater than 0.6) only be- 
tween average, median, and vote rules, and Theta (13). 



7 Conclusions 

In this paper we have compared ten functions that combine outputs of the individual 
classifiers into the ensemble. We have also studied the relationships between the 
combination rules and diversity measures. 

In general, we have observed that trained rules, such as BKS, Wernecke, Naive 
Bayes and Decision Templates, perform poorly, especially for large number of com- 
ponent classifiers (M). This result is contrary to Duin (2002), who argued that trained 
rules are better than fixed rules. 

We have also found that the mean rule and the voting rule are good for the mea- 
surement level and abstract level, respectively. 

But there are not strong correlations between the combination functions and the 
diversity measures. This means that we can not predict the ensemble accuracy for 
the particular combination method. 
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Abstract. Margin-based classifiers like the SVM and ANN have two drawbacks. They are 
only directly applicable for two-class problems and they only output scores which do not 
reflect the assessment uncertainty. A-class assessment probabilities are usually generated by 
using a reduction to binary tasks, univariate calibration and further application of the pairwise 
coupling algorithm. This paper presents an alternative to coupling with usage of the Dirichlet 
distribution. 



1 Introduction 

Although many classification problems cover more than two classes, the margin- 
based classifiers such as the Support Vector Machine (SVM) and Artificial Neural 
Networks (ANN), are only directly applicable to binary classification tasks. Thus, 
tasks with number of classes K greater than 2 require a reduction to several binary 
problems and a following combination of the produced binary assessment values to 
just one assessment value per class. 

Before this combination it is beneficial to generate comparable outcomes by cali- 
brating them to probabilities which reflect the assessment uncertainty in the binary 
decisions, see Section 2. Analyzes for calibration of dichotomous classifier scores 
show that the calibrators using Mapping with Logistic Regression or the Assign- 
ment Value idea are performing best and most robust, see Gebel and Weihs (2007). 
Up to date, pairwise coupling by Hastie and Tibshirani (1998) is the standard ap- 
proach for the subsequent combination of binary assessment values, see Section 3. 
Section 4 presents a new multi-class calibration method for margin-based classifiers 
which combines the binary outcomes to assessment probabilities for the K classes. 
This method based on the Dirichlet distribution will be compared in Section 5 to the 
coupling algorithm. 
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2 Reduction to binary problems 

Regard a classification task based on training set T : = { (x,- , c, ) , / = 1 , . . . , Al} with x; 
being the ith observation of random vector X of p feature variables and respective 
class c, G C = { 1 , . . . , /G} which is the realisation of random variable C determined by 
a supervisor. A classifier produces an assessment value or score 5 „ethod(C’ = A:|x,) for 
every class k G C and assigns to the class with highest assessment value. Some clas- 
sification methods generate assessment values Pmethod(C = k|x,) which are regarded 
as probabilties that represent the assessment uncertainty. It is desirable to compute 
these kind of probabilities, because they are useful in cost-sensitive decisions and 
for the comparison of results from different classifiers. 

To generate assessment values of any kind, margin-based classifiers need to re- 
duce multi-class tasks to seveal binary classfication problems. Allwein et al. (2000) 
generalize the common methods for reducing multi-class into B binary problems 
such as the one-against-rest and the all-pairs approach with using so-called error- 
correcting output coding (ECOC) matrices. The way classes are considered in a 
particular binary task b G {1, . . . ,B} is incorporated into a code matrix T with K 
rows and B columns. Each column vector \|/fo determines with its elements G 
{ — 1,0, -1-1} the classes for the bfh classification task. A value of = 0 implies 
that observations of the respective class k are ignored in the current task b while — 1 
and -f 1 determine whether a class k is regarded as the negative and the positive class, 
respectively. 

One-against-rest approach 

In the one-against-rest approach the number of binary classification tasks B is equal 
to the number of classes K. Each class is considered once as positive while all the 
remaining classes are labeled as negative. Hence, the resulting code matrix 'T is of 
size K X K, displaying -f 1 on the diagonal while all other elements are — 1 . 

All-pairs approach 

In the all-pairs approach one learns for every single pair of classes a binary task b in 
which one class is considered as positive and the other one as negative. Observations 
which do not belong to either of these classes are omitted in the learning of this 

[K\ 

binary task. Thus, 'PisaATx (1 -matrix with each column b consisting of elements 

= +1 and \ifk 2 ,b = corresponding to a distinct class pair {ki,k 2 ) while all 
the remaining elements are 0. 



3 Coupling probability estimates 

As described before, the reduction approaches apply to each column \|/^ of the code 
matrix T', i. e. binary task b, a classification procedure. Thus, the output of the reduc- 
tion approach consists of B score vectors s+_*(x,) for the associated positive class. 
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To each set of scores separately one of the univariate calibration methods described 
in Gebel and Weihs (2007) can be applied. The outcome is a calibrated assessment 
probability p+ ^(x,) which reflects the probabilistic confidence in assessing observa- 
tion Xj- for task b to the set of positive classes + := {^k\\ifk.b = + l} opposed to 
the set of negative classes _ := {A:; \|/^ ^ = — 1 } . Hence, this calibrated assessment 
probability can be regarded as function of the assessment probabilities involved in 
the current task: 









( 1 ) 



The values P{C = k|x,) solving equation (1) would be the assessment probabilities 
that reflect the assessment uncertainty. However, considering the additional con- 
straint to assessment probabilities 



K 

Y^p{c=k\x,) = \ 

k=l 



( 2 ) 



there exist only K — \ free parameters P{C = k|x,) but at least K equations for the 
one-against-rest approach and even more for all-pairs (K{K— l)/2). Since the num- 
ber of free parameters is always smaller than the number of constraints, no unique 
solution for the calculation of assessment probabilities is possible and an approxima- 
tive solution has to be found instead. Therefore, Hastie and Tibshirani (1998) supply 
the coupling algorithm which finds the estimated conditional probabilities i,{xi) 
as realizations of a Binomial distributed random variable with an expected value ph,i 
in a way that 

* P+i,b{'^i) generate unique assessment probabilities P{C = k|x,), 

* P{C = k\xi) meet the probability constraint (2) and 

* P+i,b{^i) have minimal Kullback-Leibler divergence to observed p^i i,{xi). 



4 Dirichlet calibration 

The idea underlying the following multivariate calibration method is to transform the 
combined binary classification task outputs into realizations of a Dirichlet distributed 
random vector P ~ 2)(/q , . . . , /i/f ) and regard the elements as assessment probabilities 
Pk:=P{C = k\x). 

Due to the concept of well-calibration by DeGroot and Fienberg (1983), we want to 
achieve that the confidence in the assignment to a particular class converges to the 
probability for this class. This requirement can be easily attained with a Dirichlet 
distributed random vector by choosing parameters lij^ proportional to the a-priori 
probabilities ,tik of classes, since elements Pk have expected values E(Pjt) = 

hklY.U'^j- 
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Dirichlet distribution 

A random vector P = {P\,. . . ,Pk)' generated by 

l^j=\ 

with K independently ^^-distributed random variables ~ (2 • ^/t) is Dirichlet 

distributed with parameters hi,...,hK, see Johnson et al. (2002). 

Dirichlet calibration 

Initially, instead of applying a univariate calibration method we normalize the output 
vectors 5,,+i,fe by dividing them by their range and add half the range so that boundary 
values {s = 0) lead to boundary probabilities {p = 0.5): 

SL+i,b + 9-maxi\si^+i^b\ 

Pi^-\-\.b ■ O I I ’ ^ 

2-p-maXi\si^+i^b\ 

since the doubled maximum of absolute values of scores is the range of scores. It is 
required to use a smoothing factor p = 1.05 in (3) so that Pi.+\,b C ]0, 1[, since we 
calculate in the following the geometric mean of associated binary proportions for 
each class { 1 , . . . , JT} 



m ■= 



n Pi,+i,b 
h-Vk,b=+l 



n i^-Pb+i.h) 

b-Wk,h=-l 



1 



This mean confidence is regarded as a realization of a Beta distributed random vari- 
able Rk ~ and parameters (Xt and [3,t are estimated from the training set 

by the method of moments. We prefer the geometric to the arithmetic mean of pro- 
portions, since the product is well applicable for proportions, especially when they 
are skewed. Skewed proportions are likely to occur when using the one-against-rest 
approach in situations with high class numbers, since here the number of negative 
strongly outnumber the positive class observations. 

To derive a multivariate Dirichlet distributed random vector, the rt ^ can be trans- 
formed to realizations of a uniformly distributed random variable 



:= ■ 

By using the inverse of the ^^-distribution function these uniformly distributed ran- 
dom variables are further transformed into ^(^-distributed random variables. The re- 
alizations of a Dirichlet distributed random vector P ~ ©(/ii , . with elements 
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are achieved by normalizing. New parameters h\,...,hK should be chosen propor- 
tional to frequencies Jti, . . . ,71^: of the particular classes. In the optimization proce- 
dure we choose the factor m= \,2, . . . ,2- N with respective parameters h)^ = m-Tik 
which score highest on the training set in terms of performance, determined by the 
geometric mean of measures (4), (5) and (6). 



5 Comparison 

This section supplies a comparison of the presented calibration methods based on 
their performance. Naturally, the precision of a classification method is the major 
characteristic of its performance. However, a comparison of classification and cal- 
ibration methods just on the basis of the precision alone, results in a loss of infor- 
mation and would not include all requirements a probabilistic classifier score has 
to fulfill. To overcome this problem, calibrated probabilities should satisfy the two 
additional axioms: 

• Effectiveness in the assignment and 

• Well-calibration in the sense of DeGroot and Fienberg (1983). 

Precision 

The correctness rate 



Cr= ^i]l[c(x,)=c,-l(N) (4) 

1=1 

where I is the indicator function, is the key performance measure in classification, 
since it mirrors the quality of the assignment to classes. 

Effective assignment 

Assessment probabilities should be effective in their assignment, i. e. moderately 
high for true classes and small for false classes. An indicator for such an effectiveness 
is the complement of the Root Mean Squared Error. 



1 " 

1 - RMSE := 1 V 

N ^ 






^ (n) - P {ci = ^|x)] ^ . 



(5) 



k=l 



Well-calibrated probabilities 

DeGroot and Fienberg (1983) give the following definition of a well-calibrated fore- 
cast: “If we forecast an event with probability p, it should occur with a relative fre- 
quency of about pP To transfer this requirement from forecasting to classification 
we partition the training/test set according to the assignment to classes into K groups 
Tk '■= {(C(,x,) G T : c(x,) = k] with NT^. '■= \Tk\ observations. Thus, in a partition Tt 
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the forecast is class k. 

Predicted classes can differ from true classes and the remaining classes j ^ k can 
actually occur in a partition T^. Therefore, we estimate the average confidence 
•= '^XieTt P (A:|c(x,) = j) for every class j in a partition 7].. According to 
DeGroot and Fienberg (1983) this confidence should converge to the average cor- 
rectness Cvkj := 12xieT^ I[c(xi)=f]' T^he average closeness of these two measures 

K K 

WCR := 1 - 5m I ~ Cr^„/ 1 (6) 

k=l j=l 

indicates how well-calibrated the assessment probabilities are. 

On the one hand, the minimizing ’’probabilities" for the RMSE (5) can be just the 
class indicators especially if overfitting occurs in the training set. On the other hand, 
vectors of the individual correctness values maximize the WCR (6). To overcome 
these drawbacks, it is convenient to combine the two calibration measures by their 
geometric mean to the calibration measure 

Cal := v/(l -RMSE) -WCR . (7) 



Experiments 

The following experiments are based on the two three-class data sets Iris and 
balance-scale from the UCI ML-Repository as well as the four-class data set B3, 
see Newman et al. (1998) and Heilemann and Milnch (1996), respectively. 

Recent analyzes on risk minimization show that the minimization of a risk based on 
the hinge loss which is usually used in SVM leads to scores without any probability 
information, see Zhang (2004). Hence, the L2-SVM, see Suykens and Vandewalle 
(1999), with using the quadratic hinge loss function and thus squared slack variables 
is preferred to standard SVM. For classification we used the L2-SVM with radial- 
basis Kernel function and a Neural Network with one hidden layer, both with the 
one-against-rest and the all-pairs approach. In every binary decision a separate 3- 
fold cross-validation grid search was used to find optimal parameters. 

The results of the analyzes with 10-fold cross-validation for calibrating L2-SVM 
and ANN are presented in Tables 1-2, respectively. 

Table 1 shows that for L2-SVM no overall best calibration method is available. For 
the Iris data set all-pairs with mapping outperforms the other methods, while for B3 
the Dirichlet calibration and the all-pairs method without any calibration are per- 
forming best. Considering the balance-scale data set, no big differences according 
to correctness occur for the calibrators. 

However, comparing these results to the ones for ANN in Table 2 shows that the 
ANN, except the all-pairs method with no calibration, yields better results for all 
data sets. 

Here, the one-against-rest method with usage of the Dirichlet calibrator outper- 
forms all other methods for Iris and B3. Considering Cr and Cal for balance-scale. 
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Table 1. Results for calibrating L2-SVM-scores 

Iris B3 balance 





Cr 


Cal 


Cr 


Cal 


Cr 


Cal 


^all-pairs, no 


0.853 


0.497 


0.720 


0.536 


0.877 


0.486 


^all-pairs, map 


0.940 


0.765 


0.688 


0.656 


0.886 


0.859 


^all-pairs, assign 


0.927 


0.761 


0.694 


0.677 


0.886 


0.832 


^all-pairs, Dirichlet 


0.893 


0.755 


0.720 


0.688 


0.888 


0.771 


^l-v-rest,no 


0.833 


0.539 


0.688 


0.570 


0.885 


0.464 


^ l-v-rest,map 


0.873 


0.647 


0.682 


0.563 


0.878 


0.784 


^1-v-rest, assign 


0.867 


0.690 


0.701 


0.605 


0.885 


0.830 


^ 1-v-rest, Dirichlet 


0.880 


0.767 


0.726 


0.714 


0.880 


0.773 



Table 2. Results for calibrating ANN-scores 



Iris B3 balance 





Cr 


Cal 


Cr 


Cal 


Cr 


Cal 


^all-pairs, no 


0.667 


0.614 


0.490 


0.573 


0.302 


0.414 


^all-pairs, map 


0.973 


0.909 


0.752 


0.756 


0.970 


0.946 


^all-pairs, assign 


0.960 


0.840 


0.771 


0.756 


0.954 


0.886 


^all-pairs, Dirichlet 


0.953 


0.892 


0.777 


0.739 


0.851 


0.619 


^1-v-rest, no 


0.973 


0.618 


0.803 


0.646 


0.981 


0.588 


^l-v-rest,map 


0.973 


0.942 


0.803 


0.785 


0.978 


0.921 


^1-v-rest, assign 


0.973 


0.896 


0.796 


0.752 


0.976 


0.829 


^1-v-rest, Dirichlet 


0.973 


0.963 


0.815 


0.809 


0.971 


0.952 



Table 3. Comparing to direct classification methods 

Iris B3 balance 





Cr 


Cal 


Cr 


Cal 


Cr 


Cal 


^ ANN, 1-v-rest, Dirichlet 


0.973 


0.963 


0.815 


0.809 


0.971 


0.952 


^DA 


0.980 


0.972 


0.713 


0.737 


0.862 


0.835 


^QDA 


0.980 


0.969 


0.771 


0.761 


0.914 


0.866 


^Logistic Regression 


0.973 


0.964 


0.561 


0.633 


0.843 


0.572 


^tree 


0.927 


0.821 


0.427 


0.556 


0.746 


0.664 


^Naive Bayes 


0.947 


0.936 


0.650 


0.668 


0.904 


0.710 



one-against-rest with mapping performs best, but with correctness just slightly bet- 
ter than the Dirichlet calibrator. 

Finally, the comparison of the one-against-rest ANN with Dirichlet calibration to 
other direct classification methods in Table 3 shows that for Iris LDA and QDA are 
the best classifiers, since the Iris variables are more or less multivariate normally dis- 
tributed. Considering the two further data sets the ANN yields highest performance. 
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6 Conclusion 

In conclusion it is to say that calibration of binary classification outputs is beneficial 
in most cases, especially for an ANN with the all-pairs algorithm. 

Comparing classification methods to each other, one can see that the ANN with one- 
against-rest and Dirichlet calibration performs better than other classifiers, except 
LDA and QDA on Iris. Thus, the Dirichlet calibration is a nicely performing alter- 
native, especially for ANN. The Dirichlet calibration yields better results with usage 
of one-against-all, since combination of outputs with their geometric mean is bet- 
ter applicable in this case where outputs are all based on the same binary decisions. 
Furthermore, the Dirichlet calibration has got the advantage that here only one opti- 
mization procedure has to be computed instead of the two steps for coupling with an 
incorporated univariate calibration of binary outputs. 
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Abstract. Kernel methods offer a flexible toolbox for pattern analysis and machine learn- 
ing. A general class of kernel functions which incorporates known pattern invariances are 
invariant distance substitution (IDS) kernels. Instances such as tangent distance or dynamic 
time-warping kernels have demonstrated the real world applicability. This motivates the de- 
mand for investigating the elementary properties of the general IDS-kernels. In this paper we 
formally state and demonstrate their invariance properties, in particular the adjustability of 
the invariance in two conceptionally different ways. We characterize the definiteness of the 
kernels. We apply the kernels in different classification methods, which demonstrates various 
benefits of invariance. 



1 Introduction 

Kernel methods have gained large popularity in the pattern recognition and machine 
learning communities due to the modularity of the algorithms and the data repre- 
sentations by kernel functions, cf. (Scholkopf and Smola (2002)) and (Shawe-Taylor 
and Cristianini (2004)). It is well known that prior knowledge of a problem at hand 
must be incorporated in the solution to improve the generalization results. We ad- 
dress a general class of kernel functions called IDS-kernels (Haasdonk and Burkhardt 
(2007)) which incorporates prior knowledge given by pattern invariances. 

The contribution of the current study is a detailed formalization of their basic 
properties. We both formally characterize and illustratively demonstrate their ad- 
justable invariance properties in Sec. 3. We formalize the definiteness properties in 
detail in Sec. 4. The wide applicability of the kernels is demonstrated in different 
classification methods in Sec. 5. 
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2 Background 

Kernel methods are general nonlinear analysis methods such as the kernel princi- 
pal component analysis, support vector machine, kernel perceptron, kernel Fisher 
discriminant, etc. (Scholkopf and Smola (2002)) and (Shawe-Taylor and Cristianini 
(2004)). The main ingredient in these methods is the kernel as a similarity measure 
between pairs of patterns from the set X. 

Definition 1 (Kernel, definiteness). A function k : X x X ^ K which is symmetric 
is called a kernel. A kernel k is called positive definite (pd), if for all n and all sets of 
observations € X” the kernel matrix K := (k(x,',xy))"y^j satisfies v^Kv > 0 

for all V G R". If this only holds for all v satisfying v^l = 0, the kernel is called 
conditionally positive definite (cpd). 

We denote some particular Z^-inner-product (•,•) and /^-distance |j 1| based ker- 

nels by k'“(x,x') := (x,x') ,k"^(x,x') := -||x-x'f for |3 € [0,2], kP°‘(x,x') := 

(1 +y(x,x'))P ,k'’’’^(x,x') := II for pGN,YG K+. Here, the linear k*™, poly- 

nomial kP°* and Gaussian radial basis function (rbf) k'^^^ are pd for the given param- 
eter ranges. The negative distance kernel is cpd (Scholkopf and Smola (2002)). 
We continue with formalizing the prior knowledge about pattern variations and cor- 
responding notation: 

Definition 2 (Transformation knowledge). We assume to have transformation 
knowledge for a given task, i.e. the knowledge of a set T = {t \ X ^ X} of trans- 
formations of the object space including the identity, i.e. id G T. We denote the set 
of transformed patterns of x G X as Tx := {t(x)\t € T} which are assumed to have 
identical or similar inherent meaning as x. 

The set of concatenations of transformations from two sets T, T' is denoted as 
ToT' . The n-fold concatenation of transformations t are denoted as := tot", the 
corresponding sets denoted as T o T" . If all t G T are invertible, we denote 

the set of inverted functions as We denote the semigroup of transformations 

generated by T as T := The set T induces an equivalence relation on X 

by X ~ v' there exist t,t' GT such that t{x) = t'{x'). The equivalence class of x is 
denoted with Ex and the set of all equivalence sets is . 

Learning targets can often be modeled as functions of several input objects, for 
instance depending on the training data and the data for which predictions are re- 
quired. We define the desired notion of invariance: 

Definition 3 (Total Invariance). We call a function f : X" ^ Of totally invariant 
with respect to T, if for all patterns xi , . . . ,x„ G X and transformations t\,...,t„GT 
holds /(xi , . . . ,x„) = ffi (xi ),..., r„(x„)) . 

As the IDS-kernels are based on distances, we define: 
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Definition 4 (Distance, Hilbertian Metric). A function r/ : X x X ^ K Is called a 
distance, if it is symmetric and nonnegative and has zero diagonal, i.e. d{x,x) = 0. 
A distance is a Hilbertian metric if there exists an embedding into a Hilbert space 
^ X ^ of such that d{x,x!) = ||‘I>(x) — ‘I’(x^)|| . 

So in particular the triangle inequality does not need to be valid for a distance 
function in this sense. Note also that a Hilbertian metric can still allow d{x,f) = 0 
for xf^f. 

Assuming some distance function d on the space of patterns X enables to incor- 
porate the invariance knowledge given by the transformations T into a new dissimi- 
larity measure. 

Definition 5 (Two-Sided invariant distance). For a given distance d on the set X 

and some cost function : T x T K+ with Q,it,t') = 0 t = t' = id,, we define 
the two-sided invariant distance as 

d 2 s(x,x!) ’.= inf d(t(x)f'{f)) + 'kQ,{t,t'). (1) 

t,t'eT 

For X = 0 the distance is called unregularized. In the following we exclude artifi- 
cial degenerate cases and reasonably assume that \imx^ood 2 s{x,x') = d{x,f) for all 
x,f. The requirement of precise invariance is often too strict for practical problems. 
The points within Tx are sometimes not to be regarded as identical to x, but only as 
similar, where the similarity can even vary over 7^. An intuitive example is optical 
character recognition, where the similarity of a letter and its rotated version is de- 
creasing with growing rotation angle. This approximate invariance can be realized 
with IDS-kernels by choosing X> 0. 

With the notion of invariant distance we define the invariant distance substitution 
kernels as follows: 

Definition 6 (IDS-Kernels). For a distance-based kernel, i.e. k(x,x') =/(||x — x'||), 
and the invariant distance measure d 2 s we call ku)s{x,x!) := f{d 2 s{x,xf)) its invari- 
ant distance substitution kernel (IDS-kernel). Similarly, for an inner-product-based 
kernel k, i.e. k(x,x') = /((x,x')), we call kjDs{x,x') := f{{x,x!)^) its IDS-kernel, 
where O G X is an arbitrary origin and a generalization of the inner product is given 
by {x,x')^ := -\{d 2 s{x,x'f- - d 2 s{x,Of- - d 2 s{x' ,OY). 

The IDS-kernels capture existing approaches such as tangent distance or dynamic 
time-warping kernels which indicates the real world applicability, cf. (Haasdonk 
(2005)) and (Haasdonk and Burkhardt (2007)) and the references therein. 

Crucial for efficient computation of the kernels is to avoid explicit pattern trans- 
formations by using or assuming some additional structure on T . An important com- 
putational benefit of the IDS-kernels must be mentioned, which is the possibility to 
precompute the distance matrices. By this, the final kernel evaluation is very cheap 
and ordinary fast model selection by varying kernel or training parameters can be 
performed. 
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3 Adjustable invariance 

As first elementary property, we address the invariance. The IDS-kernels offer two 
possibilities for controlling the transformation extent and thereby interpolating be- 
tween the invariant and non-invariant case. Firstly, the size of T can be adjusted. 
Secondly, the regularization parameter X can be increased to reduce the invariance. 
This is summarized in the following: 

Proposition 1 (Invariance of IDS-Kernels). 

i) IfT — {id} and d is an arbitrary distance, then kjos = k. 

a ) If all t GT are invertible, then distance-based unregularized IDS-kernels kj^g ( • , x) 
are constant on (T^'_o T)^. 

Hi) IfT = T and = T , then unregularized IDS-kernels are totally invariant with 
respect to T. 

iv) Ifd is the ordinary Euclidean distance, then lim^^ook/o^ = k. 

Proof Statement i) is obvious from the definition, as d 2 s = d 'm this case. Simi- 
larly, iv) follows as \imx_^ood 2 s = d. For statement ii), we note that if f G o 
T)x, then there exist transformations tf' G T such that t{x) = tff) and conse- 
quently d 2 s{x,xf) = 0. So any distance-based kernel kjQs is constant on this set 
o T)x- For proving hi) we observe that for tf' G T holds d 2 s{t{x)f'{x!)) = 
\nft t'd{t{t{x))f'{t'{x'))) > \nft t'd{t{x)f'{x')) = d 2 s{x,x'). Using the same argu- 
mentation with t{x) for X, for t and similar replacements for f f' yields 
d 2 s{x,x') > d 2 s{t{x)f'{x')), which gives the total invariance of d 2 s and thus for all 
unregularized IDS-kernels. 

Points i) to iii) imply that the invariance can be adjusted by the size of T. Point ii) 
implies that the invariance occasionally exceeds the set Tx. If for instance T is closed 
with respect to inversions, i.e. T — T^^, then the set of constant values is {T^)x- Point 
iii) and iv) indicate that X can be used to interpolate between the full invariant and 
non-invariant case. 

We give simple illustrations of the proposed kernels and these adjustability mech- 
anisms in Fig. 1. For the illustrations, our objects are simply points in two dimen- 
sions and several transformations define sets of points to be regarded as similar. We 
fix one argument x' (denoted with a black dot) of the kernel, and the other argument 
X is varying over the square [—1,2]^ in the Euclidean plane. We plot the different 
resulting kernel values k(x,x') in gray-shades. All plots generated in the sequel can 
be reproduced by the MATLAB library KerMet-Tools (Haasdonk (2005)). 

In Fig. I a) we focus on a linear shift along a certain slant direction while in- 
creasing the transformation extent, i.e. the size of T. The figure demonstrates the 
behaviour of the linear unregularized IDS-kernel, which perfectly aligns to the trans- 
formation direction as claimed by Prop. 1 i) to iii). It is striking that the captured 
transformation range is indeed much larger than T and very accurate for the IDS- 
kernels as promised by Prop. 1 ii). 

The second means for controlling the transformation extent, namely increasing 
the regularization parameter X, is also applicable for discrete transformations such 
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Fig. 1. Adjustable invariance of IDS-kernels. a) Linear kernel with invariance wrt. linear 
shifts, adjustability by increasing transformation extent by the set T,X = 0, b) kernel with 
combined nonlinear and discrete transformations, adjustability by increasing regularization 
parameter X. 



as reflections and even in combination with continuous transformations such as ro- 
tations, cf. Fig. 1 b). We see that the interpolation between the invariant and non- 
invariant case as claimed in Prop. 1 ii) and iv) is nicely realized. So the approach Is 
indeed very general concerning types of transformations, comprising discrete, con- 
tinuous, linear, nonlinear transformations and combinations thereof. 



4 Positive definiteness 

The second elementary property of interest, the positive definiteness of the kernels, 
can be characterized as follows hy applying a finding from (Haasdonk and Bahlmann 
(2004)): 

Proposition 2 (Definiteness of Simple IDS-Kernels). The following statements are 
equivalent: i) d 2 s i^ o Hilbertian metric 

ii) is cpdfor all 3 G [0, 2] Hi) is pd 

iv) pdfor all y € K+ v) is pdfor all p G N,y G M+. 

So the crucial property, which determines the (c)pd-ness of IDS-kernels is, whether 
the d 2 s is a Hilbertian metric. A practical criterion for disproving this is a violation 
of the triangle inequality. A precise characterization for d 2 s being a Hilbertian metric 
is obtained from the following. 

Proposition 3 (Characterization of d 2 s as Hilbertian Metric). The unregularized 
d 2 s is a Hilbertian metric if and only ifd2S is totally invariant with respect to T and 
d 2 S induces a Hilbertian metric on X j 
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Proof. Let d2s be a Hilbertian metric, i.e. d2s{x,f) = ||<I>(x) — <I>(x')|j. For prov- 
ing the total invariance wrt. T it is sufficient to prove the total invariance wrt. T 
due to transitivity. Assuming that for some choice of patterns/transformations holds 
d2s{x,f) f- d2s{t{x) f' {x')) a contradiction can be derived: Note that 
differs from one of both sides of the inequality, without loss of generality the left 
one, and assume d2s{x,f) < d2s{t{x),f). The definition of the two-sided distance 
implies d2s{x,t{x)) = inffi j'l d{f (x) ,t" {t{x))) = 0 via t' := t and t" — id. By the 
triangle inequality, this gives the desired contradiction d2s{x,f) < d2s{t{x),f) < 
d2s{t{x),x) + d2s{x,f) = 0 + d2s{x,f). Based on the total invariance, d2s{-,x") 
is constant on each E G Xj For all x ~ x' transformations t,t' exist such that 
t{x) =t'(x'). So we \\avt d2s{x,x") = d2s{t{x)^x"f = d2s{t' {}^),x") = d2s{x' 
this induces a well defined function on by d2s{E,E') := d2s{x{E),x{E')). Here 
x{E) denotes one representative from the equivalence class E G X/.^. Obviously, d2s 
is a Hilbertian metric, via := <I>(x(£)). The reverse direction of the proposition 
is clear by choosing <I>(x) := 

Precise statements for or against pd-ness can be derived, which are solely based on 
properties of the underlying T and base distance d: 

Proposition 4 (Characterization by d and T). 

i) If T is too small compared to T in the sense that there exists f G Tx, but 

d{Tx, Tx') > 0 , then the unregularized d2s is not a Hilbertian metric. 

ii) If d is the Euclidean distance in a Euclidean space X and Tx are parallel affine 

subspaces of X then the unregularized d2s is a Hilbertian metric. 

Proof. For i) we note that_^( 7 ]c, = inf, ,/g7-<i(t(x),t'(x')) > 0 . So c(25 is not totally 

invariant with respect to T and not a Hilbertian metric due to Prop. 3 . For statement 
ii) we can define the orthogonal projection ^ . X ^ Of \= {Tq)^ on the orthog- 
onal complement of the linear subspace through the origin O, which implies that 
d2s{x,f) = t((<I>(x),<I>(x')) and all sets Tx are projected to a single point <I>(x) in 
{Tq)^. So d2s is a Hilbertian metric. 

In particular, these findings allow to statejhat the kernels on the left of Fig. 1 are 
not pd as they are not totally invariant wrt. T. On the contrary, the extension of the 
upper right plot yields a pd kernel, as soon as Tx are complete affine subspaces. So 
these criteria can practically decide about the pd-ness of IDS-kernels. 

If IDS-kernels are involved in learning algorithms, one should be aware of the 
possible indefiniteness, though it is frequently no relevant disadvantage in practice. 
Kernel principal component analysis can work with indefinite kernels, the SVM is 
known to tolerate indefinite kernels and further kernel methods are developed that 
accept such kernels. Even if an IDS-kernel can be proven by the preceding to be 
non-(c)pd in general, for various kernel parameter choices or a given dataset, the 
resulting kernel matrix can occasionally still be (c)pd. 
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Fig. 2. Illustration of non-invariant (upper row) versus invariant (lower row) kernel meth- 
ods. a) Kernel k-nn classification with and scale-invariance, b) kernel perceptron with 
of degree 2 and y-axis reflection-invariance, c) one-class-classification with A:'*" and sine- 
invariance, d) SVM with and rotation invariance. 



5 Classification experiments 

For demonstration of the practical applicability in kernel methods, we condense the 
results on classification with IDS-kernels from (Flaasdonk and Burkhardt (2007)) in 
Fig. 2. That study also gives summaries of real-world applications in the fields of 
optical character recognition and bacteria-recognition. 

A simple kernel method is the kernel nearest-neighbour algorithm for classifi- 
cation. Fig. 2 a) is the result of the kernel 1 -nearest-neighbour algorithm with the 

and its scale-invariant kernel, where the scaling sets Tx are indicated with 
black lines. The invariance properties of the kernel function obviously transfer to the 
analysis method by IDS-kernels. 

Another aspect of interest is the convergence speed of online-learning algorithms 
exemplified by the kernel perceptron. We choose two random point sets of 20 points 
each lying uniformly distributed within two horizontal rectangular stripes indicated 
in Fig. 2 b). We incorporate the y-axis reflection invariance. By a random data draw- 
ing repeated 20 times, the non-invariant kernel of degree 2 results in 21 .00± 6.59 
update steps, while the invariant kernel converges much faster after 1 1 .55 ± 4.54 
updates. So the explicit invariance knowledge leads to improved convergence prop- 
erties. 

An unsupervised method for novelty detection is the optimal enclosing hyper- 
sphere algorithm (Shawe-Taylor and Cristianini (2004)). As illustrated in Fig. 2 c) 
we choose 30 points randomly lying on a sine-curve, which are interpreted as nor- 
mal observations. We randomly add 10 points on slightly downward/upward shifted 
curves and want these points to be detected as novelties. The linear non-invariant k’™ 
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results in an ordinary sphere, which however gives an average of 4.75 ±1.12 false 
alarms, i.e. normal patterns detected as novelties, and 4.35 ±0.93 missed outliers, i.e. 
outliers detected as normal patterns. As soon as we involve the sine-invariance by the 
IDS-kernel we consistently obtain 0.00 ±0.00 false alarms and 0.40 ±0.50 misses. 
So explicit invariance gives a remarkable performance gain in terms of recognition 
or detection accuracy. 

We conclude the 2D experiments with the SVM on two random sets of 20 points 
distributed uniformly on two concentric rings, cf. Fig. 2 d). We involve rotation in- 
variance explicitly by taking T as rotations by angles (|) G [— Jt/2, Ji/2]. In the example 
we obtain an average of 16.40 ± 1.67 SVs (indicated as black points) for the non- 
invariant case, whereas the IDS-kernel only returns 3.40 ±0.75 SVs. So there 
is a clear improvement by involving invariance expressed in the model size. This is 
a determining factor for the required storage, number of test-kernel evaluations and 
error estimates. 



6 Conclusion 

We investigated and formalized elementary properties of IDS-kernels. We have 
proven that IDS-kernels offer two intuitive ways of adjusting the total invariance 
to approximate invariance until recovering the non-invariant case for various dis- 
crete, continuous, infinite and even non-group transformations. By this they build a 
framework interpolating between invariant and non-invariant machine learning. The 
definiteness of the kernels can be characterized precisely, which gives practical cri- 
teria for checking positive definiteness in applications. 

The experiments demonstrate various benefits. In addition to the model-inherent 
invariance, when applying such kernels, further advantages can be the convergence 
speed in online-learning methods, model size reduction in SV approaches, or im- 
provement of prediction accuracy. We conclude that these kernels indeed can be 
valuable tools for general pattern recognition problems with known invariances. 
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Abstract. The problem of selection of variables seems to be the key issue in classification of 
multi-dimensional objects. An optimal set of features should be made of only those variables, 
which are essential for the differentiation of studied objects. This selection may be made easier 
if a graphic analysis of an U-matrix is carried out. It allows to easily identify variables, which 
do not differentiate the studied objects. A graphic analysis may, however, not suffice to analyse 
data when an object is described with hundreds of variables. The authors of the paper propose 
a procedure which allows to eliminate variables with the smallest discriminating potential 
based on the measurement of concentration of objects on the Kohonen self organising map 
networks. 



1 Introduction 

An intensive development of computer technologies in recent years lead i.a. to an 
enormous increase in the size of available databases. The question refers not only to 
an increase in the number of recorded cases. An essential, qualitative change is the 
increase of the number of variables describing a particular case. There are databases 
where one object is described by over 2000 attributes. Such a great number of vari- 
ables meaningfully changes the scale of problems connected with the analysis of 
such databases. It results, inter alia, in problems of separation of the group structure 
of studied objects. According to i.a. Milligan (1994, 1996, p. 348) the approach fre- 
quently applied by the creators of databases who strive to describe the objects with 
the possibly large number of variables is not only unnecessary but essentially erro- 
neous. Adding several irrelevant variables to the set of studied variables may limit or 
even eliminate the possibility of discovering the group structure of studied objects. 
In the set of variables only such variables should be included, which (cf: Gordon 
1999, p. 3), contribute to: 

• an increase in the homogeneity of separate clusters, 

• an increase in the heterogeneity among clusters, 

• easier interpretation of features of clusters which were set apart. 
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The reduction of the space of variables would also contribute to a considerable re- 
duction of time of analyses and to apply much more refined, but at the same time 
more sophisticated and time consuming methods of data analysis. 

The problem of reduction of the set of variables is extremely important while 
solving the classification problems. That is why a considerable attention was de- 
voted to it in literature (cf.: Gnanadieskian, Kettenring, Tsao, 1995). It is possible to 
distinguish three approaches to the development of an optimal set of variables: 

1 . weighing the variables - where each variable is given a weight which is related 
to its relative importance in description of the studied problem, 

2. selection of variables - consisting in the elimination of variables with the small- 
est discriminating potential from the set of variables; this approach may be con- 
sidered as a special case of the first approach where some variables are assigned 
the weight of 0 - in the case of rejected variables and the weight of 1 in the case 
of selected variables, 

3. replacement of the original variables with artificial variables - this is a classical 
statistical approach based on the analysis of principal components. 

In the present paper a method of selecting variables based on the neural SOM net- 
work belonging to the second of the above types of methods will be presented. 



2 A proposition to reduce the number of variables 

The Kohonen SOM network is a very attractive method of classifying multidimen- 
sional data. As shown by Deboeck G. and Kohonen T. (1998) it is an efficient method 
of sorting out complex data. It is also an excellent method of visualisation of multi- 
dimensional data, examples supporting this supposition may be found in Vesanto J. 
(1997). One of important properties of the SOM network is the possibility of visuali- 
sation of shares of particular variables in a matrix of unihed distances (an U-matrix). 
Joint activation of particular neurons of the network is the sum of activations result- 
ing from activation of particular variables. Since those components may be recorded 
in a separate data vector, they may be analysed independently from one another. 

Let us consider two simple examples. Figure 2 shows a set of 200 objects de- 
scribed with 2 variables. It is possible to identify a clear structure of 4 clusters, each 
made of 50 objects. The combination of both variables clearly differentiates the clus- 
ters. 

A SOM network was built for the above dataset with a hexagonal structure, with 
a dimension of 17x17 neurons with a Gaussian neighbour function. The visualisation 
of the matrix of unified distances (the U-matrix) is shown in Fig. 2. The colour of 
particular segments indicates the distance, in which a given neuron is located in 
relation to its neighbours. Since some neurons identify the studied objects, this colour 
shows at the same time the distances between objects in the space of features. The 
“wall” of higher distances is clearly visible. Large distances separate objects which 
create clear clusters (concentrations). The share of both variables in the matrix of 
unified distances (U-matrix) is presented in Fig. 2. It can be clearly observed, that 
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Fig. 1. An exemplary dataset - set 1 



variables 1 and 2 separate the set of objects, each variable dividing the set into two 
parts. Both parts of the graph indicate extreme distances between objects located 
there. This observation allows to say, that both variables are characterised with a 
similar potential of discrimination of the studied objects. Since the boundary between 
both parts is so “acute” it may be considered, that both variables have a considerable 
potential to discriminate the studied objects. 



Lknatiix 




Fig. 2. The matrix of unified distances for the dataset 1 
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Fig. 3. The share of variable 1 and 2 in the matrix of unified distances (U-matrix) - dataset 1 



The situation is different in the second case. Like in the former case we observe 
200 objects described with two variables, belonging to 4 clusters. The first vari- 
able allows to easily classify objects into 4 clusters. The variable 2 does not have, 
however, such potential, since the clusters are non-separable in relation to it. Fig. 2 
presents the objects, while Fig. 2 shows the share of particular variables in the matrix 
of unified distances (the U-matrix) based on the SOM network. 

The analysis of distance between objects with the use of the two selected vari- 
ables suggests, that variable 1 discriminates the objects very well. The borders be- 
tween clusters are clear and easily discernible. It may be said that variable 1 has a 
great discriminating potential. Variable 2 has, however, much worse properties. It is 
not possible to identify clear clusters. Objects are rather uniformly distributed over 
the SOM network. We can say that variable 2 does not have the discriminating po- 
tential. 

The application of the above procedure to assess the discriminating potential of 
variables is also highly efficient in more complicated cases and may be successfully 
applied in practice. 

Its essential weakness is the fact, that for a large number of variables it becomes 
time consuming and inefficient. A certain way to circumvent that weakness, if the 
number of variables does not exceed several hundred, is to apply a preliminary group- 
ing of variables. Very often, in socio-economic research, there are many variables 
which are differently and to a different extent correlated with one another. If we 
preliminarily distinguish the clusters of variables of similar properties, it will be pos- 
sible to eliminate the variables with the smallest discriminating potential from each 
cluster of variables. Each cluster of variables is analysed independently, what makes 
the analysis easier. An exceptionally efficient method of classification of variables is 
the SOM network which has a topology of a chain. In Figure 2 the SOM network 
is shown, which classifies 58 economic and social variables describing 307 Polish 
poviats (smallest territorial administration units in Poland) in 2004. 

In particular clusters of variables their number is much smaller than in the entire 
dataset and it is much easier to eliminate those variables with the smallest discrim- 
inating potential. At the same time this procedure does not allow to eliminate all 
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Fig. 4. An exemplary dataset - set no. 2 




(a) (b) 

Fig. 5. The share of variable 1 and 2 in a matrix of unified distances - dataset 2 



variables with similar properties, because they are located in one, not empty cluster. 
Quite frequently, because of certain factual reasons we would like to retain some 
variables, or prefer to retain at least one variable for each cluster. 

For a great number of variables, above 100, a solely graphic analysis of discrim- 
inating potential of variables would be inefficient. Thus it seems justified to look for 
an analytical method of assessment of the discriminating potential of variables based 
on the SOM network and the above observations. 

One of the possible solutions results from the observation of the location of ob- 
jects on the map of unified distances for variables. It can be observed, that the vari- 
ables with a great discriminating potential are characterised with a higher object con- 
centration on the map than the variables with a small potential. The variables with 
a small discriminating potential are to an important extent rather uniformly located 
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Fig. 6. The share of variable 1 and 2 in a matrix of unified distances (an U-matrix) 



on the map. On the basis of this observation we propose to apply the concentration 
indices on the SOM map in the assessment of discriminating potential of variables. 
In the presented study we tested the two known concentration indices. The first one 
is the concentration index based on entropy: 



where: 



Ke=\- 



H 

log2(n) 



( 1 ) 



^^ = ^(A-log2( — )) (2) 

,=i P' 

The second of proposed indices is the classical Gini concentration index: 



i= 1 i=2 

Both indices were written in the form appropriate for individual data. It seems 
that higher values of those coefficients should suggest variables with a greater dis- 
criminating potential. 



3 Applications and results 

As a result of application of the proposed indices in the first example, the values 
recorded in Table 1 were received (SOM network the same like in Fig 2). 

The value of discriminating potential was initially assessed as high for both vari- 
ables. The values of concentration coefficients for both variables were also similar^ 

* It is worth to note, that the value of coefficients is of no relevance here. The differences 
between values of particular variables are more important. 
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Table 1. Values of concentration coefficients for set 1 . 



Variable 


Ke 


Gini 


1 


0.0412 


0.3612 


2 


0.0381 


0.3438 



The values of indices for variables from the second example are given in Table 
2 (SOM network the same like in Fig 2). As it is possible to observe, the second 
variable is characterised with much smaller values of concentration coefficients than 
the first variable. 



Table 2. Values of concentration coefficients for set 2. 



Variable 


Ke 


Gini 


1 


0.0411 


0.3568 


2 


0.0145 


0.2264 



It is compatible with observations based on graphic analysis, since the discrimi- 
nating potential of the first variable was assessed as high, while the potential of the 
second variable was assessed as low. The procedure of elimination of variables of 
a low discriminating potential may be connected with a procedure of classification 
of variables. Thus a situation may be prevented, where all variables of a given type 
would be eliminated, if they were located in one cluster of variables only. Such prop- 
erty will be desirable in many cases. A full procedure of elimination of variables is 
presented in Fig. 3. It is a procedure consisting in several stages. In the first stage 
the SOM network is built on the basis of all variables. Then the values of concentra- 
tion coefficients are determined. In the second stage variables are classified on the 
basis of the SOM network with a chain topology. Then, variables with the smallest 
value of concentration coefficient are eliminated from each cluster of variables. In 
the third stage a new SOM network is built for a reduced set of variables. In order 
to assess, whether the elimination of particular variables leads to an improvement 
in the resulting group structure, the value of one index of the quality of classifica- 
tion should be identified. Among the better known ones it is possible to mention the 
Calinski-FIarabasz, Davies-Bouldin^, and Silhouette^ indices. In the quoted research 
the value of the Silhouette index was determined. Apart from its properties that allow 
for a good assessment of the group structure of objects, this index allows to visualise 
the belonging of objects to particular clusters, what is compatible with the idea of 
studies based on graphic analysis proposed here. This procedure is repeated until the 
number of variables in a cluster of variables is not smaller than a certain number 

^ Compare: Milligan G.W., Cooper M.C. (1985), An examination of procedures for deter- 
mining the number of clusters in data set. Psychometrika, 50(2), p. 159-179. 

^ Rousseeuw P.J. (1987), Silhouettes: a graphical aid to the interpretation and validation of 
clu.ster analysis. J. Comput. Appl. Math. 20, p. 53-65. 
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determined in advance and the value of the Silhouette index increases. The appli- 
cation of the above procedure (compare Fig. 3) for the determination of an optimal 
set of variables in the description of Polish poviats is presented in Table 3. In the 
presented analysis the reduction of variables was carried out on the basis of the Ke 
concentration coefficient since it manifested several times higher differentiation of 
particular variables than the Gini coefficient. The value of the Silhouette index for 
the classification of poviats on the basis of all variables adopts the value of -0.07. 
It suggests, that the group structure is completely false. Elimination of the variable 
no. 24“^ clearly improves the group structure. In the subsequent iterations subsequent 
variables are systematically eliminated, increasing the value of the Silhouette index. 

After six iterations the highest value of the Silhouette index is achieved and the 
elimination of further variables does not result in an improvement of the resulting 
cluster structure. The cluster structure obtained after the reduction of 14 variables is 
not very strong, but it is meaningfully better than the one resulting from the consid- 
eration of all variables. The resulting classification of poviats is factually justified, it 
is possible then to well interpret the clusters^ . 



Table 3. Values of the Silhouette index after the reduction of variables 



Step 


Removed Variables 


Global Silhouette Index 


0 


all var. 


-0.07 


1 


24 


0.10 


2 


36 


0.11 


3 


18, 43 


0.11 


4 


1,2, 3,6 


0.13 


5 


3, 15, 26, 39 


0.28 


6 


4, 17 


0.39 


7 


5, 20, 23 


0.38 



4 Conclusions 

The proposed method of selection of variables has numerous advantages. It is a fully 
automatic procedure, compatible with the Data Mining philosophy of analyses. Sub- 
stantial empirical experience of the authors suggest, that it leads towards a consider- 
able improvement in the obtained group structure in comparison with the analysis of 
the whole data set. It is more efficient the greater is the number of variables studied. 

^ After each iteration the variables are renumbered anew, that is why in subsequent iterations 
the same numbers of variables may appear. 

^ Compare: Migdal Najman K., Najman K. (2003), Zastosowanie sieci neuronowej typu 
SOM w badaniu przestrzennego zroznicowania powiatow (Application of the SOM neural 
network in studies of spatial differentiation of poviats), Wiadomosci Statystyczne, 4/2003, 
p. 72-85 
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Fig. 7. Procedure of determination of an optimal set of variables 



This procedure may be also applied together with other methods of data classifica- 
tion as a preprocessor. It is also possible to apply other measures of discriminating 
potential than the concentration coefficients. It is also possible to use the measures 
based on the distance between objects on the SOM map. 

The proposed method is, however, not devoid of flaws. Its application should be 
preceded with a subjective determination of a minimum number of variables in a 
single cluster of variables. There are no factual indications, how great that number 
should be. This method is also very sensitive to the quality of the SOM network 
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itself^. Since the learning algorithm of the SOM network is not deterministic, in 
subsequent iterations it is possible to obtain a network with very weak discriminating 
properties. In such a situation the value of the Silhouette index in subsequent stages 
of variable reduction may not be monotone, what would make the interpretation 
of obtained results substantially more difficult. At the end it is worth to note that 
for large databases the repetitive construction of the SOM networks may be time 
consuming and may require a large computing capacity of the computer equipment 
used. 

In the opinion of the authors the presented method proved its utility in numerous 
empirical studies and may be successfully applied in practice. 
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® The quality of the SOM network is assessed on the basis of the following coefficients: 
topographic, distortion and quantisation. 




Computer Assisted Classification of Brain Idmors 



Norbert R6hrl\ Jose R. Iglesias-Rozas^ and Galia Weidl^ 

' Institut fur Analysis, Dynamik und Modellierung, Universitat Stuttgart 
Pfaffenwaldring 57, 70569 Stuttgart, Germany 
roehrl@iadm.uni-stuttgart.de 
^ Katharinenhospital, Institut fiir Pathologie, Neuropathologie 
Kriegsbergstr. 60, 70174 Stuttgart, Germany 
j r . iglesias@katharinenhospi tal . de 

Abstract. The histological grade of a brain tumor is an important indicator for choosing the 
treatment after resection. To facilitate objectivity and reproducibility, Iglesias et al. (1986) 
proposed to use a standardized protocol of 50 histological features in the grading process. 

We tested the ability of Support Vector Machines (SVM), Learning Vector Quantization 
(LVQ) and Supervised Relevance Neural Gas (SRNG) to predict the correct grades of the 
794 astrocytomas in our database. Furthermore, we discuss the stability of the procedure with 
respect to errors and propose a different parametrization of the metric in the SRNG algorithm 
to avoid the introduction of unnecessary boundaries in the parameter space. 



1 Introduction 

Although the histological grade has been recognized as one of the most powerful 
predictors of the biological behavior of tumors and significantly affects the manage- 
ment of patients, it suffers from low inter- and intraobserver reproducibility due to 
the subjectivity inherent to visual observation. The common procedure for grading 
is that a pathologist looks at the biopsy under a microscope and then classifies the 
tumor on a scale of 4 grades from I to IV (see Fig. 1). The grades roughly correspond 
to survival times: a patient with a grade I tumor can survive 10 or more years, while 
a patient with a grade IV tumor dies with high probability within 15 month. Iglesias 
et al. (1986) proposed to use a standardized protocol of 50 histological features in 
addition to make grading of tumors reproducible and to provide data for statistical 
analysis and classification. 

The presence of these 50 histological features (Fig. 2) was rated in 4 categories 
from 0 (not present) to 3 (abundant) by visual inspection of the sections under a 
microscope. The type of astrocytoma was then determined by an expert and the cor- 
responding histological grade between I and IV is assigned. 
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Fig. 1. Pictures of biopsies under a microscope. The larger picture is healthy brain tissue 
with visible neurons. The small pictures are tumors of increasing grade from left top to right 
bottom. Note the increasing number of cell nuclei and increasing disorder. 




++ 



+++ 



Fig. 2. One the 50 histological features: Concentric arrangement. The tumor cells build con- 
centric formations with different diameters. 



2 Algorithms 

We chose LVQ (Kohonen (1995)), SRNG (Villmann et al. (2002)) and SVM (Vap- 
nik (1995)) to classify this high dimensional data set, because the generalization 
error (expectation value of misclassification) of these algorithms does not depend on 
the dimension of the feature space (Barlett and Mendelson (2002), Crammer et al. 
(2003), Hammer et al. (2005)). 

For the computations we used the original LVQ-PAK (Kohonen et al. (1992)), 
LIBSVM (Chan and Lin (2001)) and our own implementation of SRNG, since to our 
knowledge there exists no freely available package. Moreover for obtaining our best 
results, we had to deviate in some respects from the description given in the original 
article (Villmann et al. (2002)). In order to be able to discuss our modification we 
briefly formulate the original algorithm. 

2.1 SRNG 

Let the feature space be M" and fix a discrete set of labels (X, a training set T C 
K” X 9^ and a prototype set C C K” x 9^. 

The distance in feature space is defined to be 
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n 

dx{x,x) = y^^Xi\xj-Xi\^ . 

/■-i 

with parameters X = (Xi , . . . ,Xn) C R”, K > 0 and = 1. Given a sample {x,y) C 
T, we define denote its distance to the closest prototype with a different label by 

dx {x,y), 

dx(x,y) :=min{t/(x,x)|(x,y) eC,y^y}. 

We denote the set of all prototypes with label y by 

Wy:={{x,y) GC} 

and enumerate its elements (x,y) according to their distance to (x,y) 

"^^(x,y)ix,y) ■■= G Wy\d{x,x) < d(x,x)}\ . 

Then the loss of a single sample (x,y) G T is given by 



where y is the neighborhood range, sgd = ( 1 + exp ( — x) ) ^ ^ the sigmoid function and 

|w,|-i 

^ er'" 



n=0 



a normalization constant. The actual SRNG algorithm now minimizes the total loss 
of the training set T <ZX 



Lcx{T) = Y. ( 1 ) 

ixy)eT 

by stochastic gradient descent with respect to the prototypes C and the parameters of 
the metric X, while letting the neighborhood range y approach zero. This in particular 
reduces the dependence on the initial choice of the prototypes, which is a common 
problem with LVQ. 

Stochastic gradient descent means here, that we compute the gradients Vci and 
of the loss function L^; x{x,y) of a single randomly chosen element (x,y) of the 
training set and replace C by C — Ec^cT and X by X — Ex^xL with small learning 
rates Ec > 10 e;i^ > 0. The different magnitude of the learning rates is important, be- 
cause classification is primarily done using the prototypes. If the metric is allowed to 
change too quickly, the algorithm will in most cases end in a suboptimal minimum. 




58 



Norbert Rohrl, Jose R. Iglesias-Rozas and Galia Weidl 



2.2 Modified SRNG 

In our early experiments and while tuning SRNG for our task, we found two prob- 
lems with the distance used in feature space. 

The straight forward parametrization of the metric comes at the price of intro- 
ducing the boundaries X,- > 0, which in practice are often hit too early and knock 
out the corresponding feature. Also, artificially setting negative X,- to zero does slow 
down the convergence process. 

The other point is, that by choosing different learning rates Ec and e;^ for proto- 
types and metric parameters, we are no longer using the gradient of the given loss 
function (1), which can also be problematic in the convergence process. 

We propose using the following metric for measuring distance in feature space 

n 

dx{x,x) = , 

1=1 

where the dependence on X, is exponential and we introduce a scaling factor r > 0. 
This definition avoids explicit boundaries for X, and r allows to adjust the rate of 
change of the distance function relative to the prototypes. Hence this parametriza- 
tion enables us to minimize the loss function by stochastic gradient descent without 
treating prototypes and metric parameters separately. 



3 Results 

To test the prediction performance of the algorithms (Table 3), we divided the 794 
cases (grade I: 156, grade II: 362, grade III: 238, grade 4: 38) into 10 subsets of equal 
size and grade distribution for cross validation. 

For SVM we used a RBF kernel and let LIBSVM choose its two parameters. 
LVQ performed best with 700 prototypes (which is roughly equal to the size of the 
training set), a learning rate of 0. 1 and 70000 iterations. 

Choosing the right parameters for SRNG is a bit more complicated. After some 
experiments using cross validation, we got the best results using 357 prototypes, a 
learning rate of 0.01, a metric scaling factor r = 0.1 and a fixed neighborhood range 
Y = 1. We stopped the iteration process once the classification results for the training 
set got worse. An attempt to choose the parameters on a grid by cross validation over 
the training set yielded a recognition rate of 77.47%, which is almost 2% below our 
best result. 

For practical applications, we also wanted to know how good the performance in 
the presence of noise would be. If we prepare the testing set such that in 5% of the 
features uniformly over all cases, a feature is ranked one class higher or lower with 
equal probability, we still get 76.6% correct predictions using SVM and 73.1% with 
SRNG. At 10% noise, the performance drops to 74.3% (SVM) resp. 70.2% (SRNG). 
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Table 1. The classification results. The columns show how many cases of grade i where clas- 
sified as grade j. For example, in SRNG grade 1 tumors were classified as grade 3 in 2.26% 
of the cases. 
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Total 


LVQ 


SRNG 


SVM 1 


good 


73.69 


79.36 


79.74 1 



4 Conclusions 

We showed that the histological grade of the astrocytomas in our database can be 
reliably predicted with Support Vector Machines and Supervised Relevance Neural 
Gas from 50 histological features rated on a scale from 0 to 3 by a pathologist. Since 
the attained accuracy is well above the concordance rates of independent experts 
(Coons et al. (1997)), this is a first step towards objective and reproducible grading 
of brain tumors. 

Moreover we introduced a different distance function for SRNG, which in our 
case improved convergence and reliability. 
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Abstract. Mixture regression models have increasingly received attention from both market- 
ing theory and practice, but the question of selecting the correct number of segments is still 
without a satisfactory answer. Various authors have considered this problem, but as most of 
available studies appeared in statistics literature, they aim to exemplify the effectiveness of 
new proposed measures, instead of revealing the performance of measures commonly avail- 
able in statistical packages. The study investigates how well commonly used information cri- 
teria perform in mixture regression of normal data, with alternating sample sizes. In order to 
account for different levels of heterogeneity, this factor was analyzed for different mixture 
proportions. As existing studies only evaluate the criteria’s relative performance, the resulting 
success rates were compared with an outside criterion, so called chance models. The findings 
prove helpful for specific constellations. 



1 Introduction 

In the field of marketing, finite mixture models have recently received increasing 
attention from both a practical and theoretical point of view. In the last years, tradi- 
tional mixture models have been extended by various multivariate statistical methods 
such as multidimensional scaling, exploratory factor analysis (DeSarbo et al. (2001)) 
or structural equation models (Jedidi et al. (1979); Hahn et al. (2002)), whereas 
regression models (Wedel and Kamakura, (1999), p. 99) for normally distributed 
data are the most common analysis procedure in marketing context, e.g. in terms of 
conjoint and market response models (Andrews et al. (2002); Andrews and Currim 
(2003b), p. 316). Correspondingly, mixture regression models are prevalent in mar- 
keting literature. Despite their widespread use and the importance of retaining the 
true number of segments in order to reach meaningful conclusions from any anal- 
ysis, model selection is still an unresolved problem (Andrews and Currim (2003a), 
p. 235; Wedel and Kamakura (1999), p. 91). Choosing the wrong number of seg- 
ments results in an under- or oversegmentation, thus leading to flawed management 
decisions on e.g. customer targeting, product positioning or the determination of the 
optimal marketing mix (Andrews and Currim (2003a), p. 235). Therefore the objec- 
tive of this paper is to give recommendations on which criterion should be considered 
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at what combination of sample/segment size in order to identify the true number of 
segments in a given data set. 

Various authors have considered the problem of choosing the number of seg- 
ments in mixture models in different context. But as most of the available studies 
appeared in statistics literature, they aim at exemplifying the effectiveness of new 
proposed measures, instead of revealing the performance of measures commonly 
available in statistical packages. Despite its practical importance, this topic has not 
been thoroughly considered for mixture regression models. An exception in this area 
are the studies by Hawkins et al. (2001), Andrews and Currim (2003b) and Oliveira- 
Brochado and Martins (2006), that examine the performance of various information 
criteria against several factors such as measurement level of predictors, number of 
predictors, separation of the segments or error variance. Regardless of the broad 
scope of questions covered in these studies, they do not profoundly investigate the 
criteria’s performance against the one factor best influenceable by the marketing an- 
alyst, namely the sample size. From an application-oriented point of view, it is de- 
sirable to know which sample size is necessary in order to guarantee validity when 
choosing a model with a certain criterion. Furthermore, the sample size is a key 
differentiator between different criteria, having a large effect on the criteria’s effec- 
tiveness. Therefore, the hrst objective of this study is to determine how well the 
information criteria perform in mixture regression of normal data with alternating 
sample sizes. Another factor that is closely related to this problem concerns segment 
size ratio, as past research suggests the mixture proportions to have a signihcant ef- 
fect on the criteria’s performance (Andrews and Currim (2003b)). Even though a 
specihc sample size might prove benehcial in order to guarantee a satisfactory per- 
formance of the information criteria in general, the presence of niche segments might 
lead to a reduced heterogeneity and thus to a wrong decision in choosing the number 
of segments. That is why the second objective is to measure the information cri- 
teria’s performance in order to be able to assess the validity of the criteria chosen 
when specihc segment and sample sizes are present. These factors are evaluated for 
a three-segment solution by conducting a Monte Carlo simulation. 



2 Model selection in mixture models 

Assessing the number of segments in a mixture model is a difficult but important 
task. Whereas it is well known that conventional x^-based goodness of ht tests and 
likelihood ratio tests are unsuitable for making this determination (Aitkin and Ru- 
bin (1985)), the decision on what model selection statistic should be used still re- 
mains unsolved (McLachlan and Peel (2000)). Different test procedures, designed to 
circumnavigate implementation problems of classical 5 (^-tests exist, but haven’t yet 
found their way into widely used software applications for mixture model estima- 
tion (Sarstedt (2006), p. 8). Another main approach for deciding on the number of 
segments is based on a penalized form of the likelihood. These so called information 
criteria. Information criteria for model selection simultaneously take into account the 
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goodness-of-fit (likelihood) of a model and the number of parameters used to achieve 
that fit. 

The simulation study focuses on four of the most representative and widely ap- 
plied model selection criteria. In a recent study by Oliveira-Brochado and Martins 
(2006), the authors report that in 37 published studies, the Akaike’s Information Cri- 
terion (AIC) (Akaike, 1973) was used 15 times, the Consistent Akaike’s Information 
criterion (CAIC) (Bozdogan (1987)) was used 13 times and the Bayes Information 
Criterion (BIC) (Schwarz (1978)) was used II times (multiple selections possible). 
In another meta-study of all major marketing journals, Sarstedt (2006) observes that 
BIC, AIC, CAIC and the Modified AIC with factor three (AIC 3 ) (Bozdogan (1994)) 
are the selection statistics most frequently used in mixture regression analysis. In 
none of the studies examined by Sarstedt did the author draw back on statistical tests 
to decide on the number of segments in the mixture. This report narrows its focus 
on presenting the simulation results for AIC, BIC, CAIC and AIC 3 . Furthermore, 
the Adjusted BIC (Rissanen, 1978) is considered because the authors expect an in- 
creased usage due to its implementation into the increasingly popular software for 
estimating mixture models, Mplus. For a detailed discussion on the statistical prop- 
erties of the criteria, the reader is referred to the references cited above. 



3 Simulation design 

The strategy for this simulation consists of initially drawing observations derived 
from an ordinary least squares regression and applying these to the FlexMix algo- 
rithm (Leisch, 2004; Grtin and Leisch (2006)). FlexMix is a general framework for 
finite mixtures of regression models using the EM algorithm (Dempster et ah, 1977) 
which is available as an extension package for the statistical computing software R. 
In this simulation study, models with alternating observations and three continuous 
predictors were considered for the OLS regression. First, Y = [3'X was computed for 
each observation, where X was drawn from a normal distribution. Subsequently an 
error term derived from a standard normal distribution was added to the true values. 
Each simulation set up was run with 1.000 iterations. The main parameters control- 
ling the simulation were: 

• The number of segments: K =3 

• The regression coefficients in each segment which were specified as follows: 

- Segment 1: pi = (1, 1, 1.5, 2. 5)' 

- Segment 2: 32 = (1,2.5, 1.5,4)' 

- Segment 3: 33 = (2, 4. 5, 2. 5, 4)' 

• Sample sizes which were varied in a hundred-step interval of [100; 1.000]. For 
each of the sample sizes the simulation was run for three types of mixture pro- 
portions. To allow for a high level of heterogeneity, two small and one large 
segment were generated. 

- Minor proportions: Jt[ = JI 2 = 0. 1 and Jt] =0.8 

- Intermediate proportions: Jt[ = Jtj = 0.2 and = 0.6 
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- Near-uniform proportions: n\ =71^ = 0.3 and JI 3 = 0.4 
• Each simulation run was carried out five times for A: = 1 , . . . , 5 segments. 

The likelihood was maximized using the EM algorithm. As a limitation of the 
algorithm is its convergence to local maxima (Wedel and Kamakura (1999), p. 88 ), 
it was run repeatedly with 10 replications, totalling in 50 runs per iteration. Eor each 
number of segments, the best solution was picked. 



4 Results summary 

The performance of each criterion was measured by their success rate, or by the 
percentages of iterations in which the criterion succeeded in determining the true 
number of segments in the model. As indicated above, previous studies only observe 
the criteria’s relative performance, ignoring the question whether the criteria perform 
any better than chance. To gain a deeper understanding of the criteria’s absolute per- 
formance one has to compare the success rates with an ex-ante specified chance 
model. In order to verify whether the criteria are adequate, the predictive accuracy of 
each criterion with respect to chance is measured using the following chance models 
derived from discriminant analysis (Morrison (1969)): Random chance, proportional 
chance and maximum chance criterion. In order to be able to apply these criteria, 
the researcher has to have prior knowledge or make presumptions concerning the 
underlying model: Eor a given data set, let Mj be a model with Kj segments from a 
consideration set with C competing models K = {Mi, . . . ,Mc} and p, be the prior 
probability to observe Mj, {j = 1, . . . ,C) and P; = 1- The random chance cri- 
terion is CMran = ^ “ P’ which indicates that each of the competing models has 
an equal prior probability. The proportional chance criterion is CMprop = Pf’ 
which has been used mainly as a point of reference for subjective evaluation (Mor- 
rison (1969)), rather than the basis of a statistical test to determine if the expected 
proportion differs from the observed proportion of models that is correctly classified. 
The maximum chance criterion is CM^ax = ntax(pi , . . . , pc), which defines the max- 
imum prior probability to observe model j in a given consideration set as being the 
benchmark for a criterion’s success rate. Since CMran < CMprop < CMmax , CMmax 
denotes the strictest of the three chance model criteria. If a criterion cannot do better 
than CMmax, one might disregard the model selection statistics and choose Mj where 
maxjpy) . But as model selection criteria may defy the odds by pointing at a model i 
where p,- < max(p,), in most situations CMprop should be used. 

Relating to the focus of this article, an information criterion is adequate for a 
certain factor level combination when the success rate is greater than the value of a 
given chance model criterion. If this is not the case, a researcher shoud rather revert 
to practical considerations as for example segment identifiability when choosing the 
number of segments. To make use of the idea of chance models, one can define a 
consideration set K = |Mi , M 2 , M 3 } where Mi denotes a model with K — 2 segments 
(underfitting), M 2 a model with K=3 segments (success) and M 3 a model with K>4 
segments (overfitting), thus leading to the random chance criterion CMmn ~ 0.33. 
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Suppose a researcher has the following prior probabilities to observe one of the 
models, pi = 0.5, p2 = 0.3, and p3 = 0.2 the proportional chance criterion for each 
factor level combination is CMprop = 0.38 and the maximum chance criterion is 
CMmax = 0.5. The following figures illustrate the findings of the simulation run. Line 
charts are used to show the success rates for all sample/segment size combinations. 
Vertical dotted lines illustrate the boundaries of the previously mentioned chance 
models with K = {Mi, M2, M3}: CMi-an ~ 0.33 (lower dotted line), CMpi-op = 0.38 
(medial dotted line) and CMmax = 0.5 (upper dotted line). These boundaries are just 
exemplary and need to be specihed by the researcher in dependence of the analysis 
at hand. Figure 1 illustrates the success rates of the five information criteria with re- 



Minor proportions 




Fig. 1. Success rates with minor mixture proportions 



spect to minor mixture proportions. Whereas AIC demonstrates a poor performance 
across all levels of sample size, CAIC outperforms the other criteria across almost all 
factor levels. The criterion performs favorably in recovering the true number of seg- 
ments, meeting exemplary chance boundaries for sample sizes of approximately 150 
(random chance, proportional chance) and 250 (maximum chance), respectively. The 
results in figure 2 from intermediate and near-uniform mixture proportions confirm 
the previous hndings and underline the CAIC’s strong performance in small sam- 
ple size situations, quickly achieving success rates of over 90%. Flowever as sample 
sizes increase to 400, both ABIC and AIC3 perform advantageously. Even with near- 
unifrom mixture proportions, AIC fails to any meet chance boundaries used in this 
set-up. In contrast to previous findings by Andrews and Currim (2003b), CAIC out- 
performs BIC across almost all sample/segment size combinations, whereupon the 
deviation is marginal in the minor mixture proportion case. 
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Near-uniform proportions 




Intermediate proportions 




AlC BIC CAIC ABIC AIC3 



Fig. 2. Success rates with intermediate and near-uniform mixture proportions 



5 Key contributions and future research directions 

The findings presented in this paper are relevant to a large number of researchers 
building models using mixture regression analysis. This study extends previous stud- 
ies by evaluating how the interaction of sample and segment size affects the perfor- 
mance of five of the most widely used information criteria for assessing the true 
number of segments in mixture regression models. For the first time the quality of 
these criteria was evaluated for a wide spectrum of possible sample/segment-size 
constellations. AIC demonstrates an extremely poor performance across all simula- 
tion situations. From an application-oriented point of view, this proves to be prob- 



Model Selection in Mixture Regression Analysis 67 



lematic, taking into account the high percentage of studies relying on this criterion 
to assess the number of segments in the model. CAIC performs favourably, show- 
ing slight weaknesses in determining the true number of segments for higher sample 
sizes, in comparison to ABIC and AIC3. Especially in the context of intermediate 
and near-uniform mixture proportions AIC3 performs well, quickly achieving high 
success rates. 

A continued research on the performance of model selection criteria is needed 
in order to provide practical guidelines for disclosing the true number of segments 
in a mixture and to guarantee accurate conclusions for marketing practice. In the 
present study, only three combinations of mixture proportions were considered, but 
as the results show that market characteristics (i.e. different segment sizes) affect 
the performance of the criteria, future studies could allow for a greater variation of 
these proportions. However, considering the high number of research projects, one 
generally has to be critical with the idea of finding a unique measure that can be 
considered optimal in every simulation design or even practical applications, as in- 
dicated in other studies. Model selection decisions should rather be based on various 
evidences, not only derived from the data at hand but also from theoretical consider- 
ations. 
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Abstract. In this paper four local classification methods are described and their statistical 
properties in the case of local data generating processes (LDGPs) are compared. In order to 
systematically compare the local methods and LDA as global standard technique, they are 
applied to a variety of situations which are simulated by experimental design. This way, it is 
possible to identify characteristics of the data that influence the classification performances of 
individual methods. For the simulated data sets the local methods on the average yield lower 
error rates than LDA. Additionally, based on the estimated effects of the influencing factors, 
groups of similar methods are found and the differences between these groups are revealed. 
Furthermore, it is possible to recommend certain methods for special data structures. 



1 Introduction 

We consider four local classification methods that all use the Bayes decision rule. 
The Common Components and the Hierarchical Mixture Classifiers, as well as Mix- 
ture Discriminant Analysis (MDA), are based on mixture models. In contrast, the 
Localized LDA (LLDA) relies on locally adaptive weighting of observations. Appli- 
cation of these methods can be beneficial in case of local data generating processes 
(LDGPs). That is, there is a finite number of sources where each one can produce 
data of several classes. The local data generation by individual processes can be de- 
scribed by local models. The LDGPs may cause, for example, a division of the data 
set at hand into several clusters containing data of one or more classes. For such 
data structures global standard methods may lead to poor results. One way to obtain 
more adequate methods is localization, which means to extend global methods for 
the purpose of local modeling. Both MDA and LLDA can be considered as localized 
versions of Linear Discriminant Analysis (LDA). 

In this paper we want to examine and compare some of the statistical properties of 
the four methods. These are questions of interest: Are the local methods appropriate 
to classification in case of LDGPs and do they perform better than global methods? 
Which data characteristics have a large impact on the classification performances 
and which methods are favorable to special data structures? For this purpose, in a 
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simulation study the local methods and LDA as widely-used global technique are 
applied systematically to a large variety of situations generated and simulated by ex- 
perimental design. 

This paper is organized as follows: First the four local classification methods are de- 
scribed and compared. In section 3 the simulation study and its results are presented. 
Finally, in section 4 a summary is given. 



2 Local classification methods 

2.1 Common Components Classifier - CC Classifier 

The CC Classifier (Titsias and Likas (2001)) constitutes an adaptation of a radial ba- 
sis function (RBF) network for class conditional density estimation with full sharing 
of kernels among classes. Miller and Uyar (1998) showed that the decision func- 
tion of this RBF Classifier is equivalent to the Bayes decision function of a classifier 
where class conditional densities are modeled by mixtures with common mixture 
components. 

Assume that there are K given classes denoted by ci, . . .,ck- Then in the common 
components model the conditional density for class Ck is 

Gcc 

/e(x|Q) =^Jt,i/e^.(x| j) fork=\,...,K, (1) 

7=1 

where 0 denotes the set of all parameters and Tij^ represents the probability P{j \ c/c). 
The densities /e^ (x | y ) , J = 1 , . . . , Gcc. with 0y denoting the corresponding parame- 
ters, do not depend on Ck- Therefore all class conditional densities are explained by 
the same Gcc mixture components. 

This implicates that the data consist of Gcc groups that can contain observations of 
all K classes. Because all data points in group j are explained by the same density 
Iq. (x I j) classes in single groups are badly separable. The CC Classifier can only 
perform well if individual groups mainly contain data of a unique class. This is more 
likely if the parameter Gcc is large. Therefore the classification performance de- 
pends heavily on the choice of Gcc- 

In order to calculate the class posterior probabilities the parameters 0,- and the pri- 
ors Tijk and Pk := P{ck) are estimated based on maximum likelihood and the EM 
algorithm. Typically, /ey(x| y) is a normal density with parameters 0^ = {fij,'Lj}. A 
derivation of the EM steps for the gaussian case is given in Titsias and Likas (2001), 
p. 989. 

2.2 Hierarchical Mixture Classifier - HM Classifier 

The HM Classifier (Titsias and Likas (2002)) can be considered as extension of the 
CC Classifier. We assume again that the data consist of Ghm groups. But addition- 
ally, we suppose that within each group y, y = 1,...,Ghm. there are class-labeled 
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subgroups that are modeled by the densities /e^^. {x\ck,j) for k= \,... ,K, where 0^:^ 
are the corresponding parameters. Then the unconditional density of x is given by a 
three-level hierarchical mixture model 

Ghm K 

/e (x) = "./ I 

j=i k=i 

with Tij representing the group prior probability P(j) and P/^j denoting the probability 
P{ck I j)- The class conditional densities take the form 

Ghm 

fQt{x\ck) = '^'Hikfdtjix\ckJ) ioik=l,...,K, (3) 

,/=i 

where 0j. denotes the set of all parameters corresponding to class Ck- Here, the mix- 
ture components /e^^(v| c<;,7) depend on the class labels and hence each class 
conditional density is described by a separate mixture. This resolves the data repre- 
sentation drawback of the common components model. 

The hierarchical structure of the model is maintained when calculating the class pos- 
terior probabilities. In a first step, the group membership probabilities P{j\x) are 
estimated and, in a second step, based on P{j\x) estimates for Tij, P^j and 0^y are 
computed. For calculating P{j \ x) the EM algorithm is used. Typically, /e^^. (x | Ck,j) 
is the density of a normal distribution with parameters Qtj = {/^kj,^kj}- Details on 
the EM steps in the gaussian case can be found in Titsias and Likas (2002), p. 2230. 
Note that the estimate Q^j is only provided if Pj^j ^ 0. Otherwise, it is assumed that 
group j does not contain data of class q and the associated subgroup is pruned. 

2.3 Mixture Discriminant Analysis - MDA 

MDA (Hastie and Tibshirani (1996))isa localized form of Linear Discriminant Anal- 
ysis (LDA). Applying LDA is equivalent to using the Bayes rule in case of normal 
populations with different means and a common covariance matrix. The approach 
taken by MDA is to model the class conditional densities by gaussian mixtures. 
Suppose that each class Ck is artificially divided into St subclasses denoted by Ckj, 
j = and define S := X)f=i ^k as total number of subclasses. The subclasses 

are modeled by normal densities with different mean vectors lu^j and, similar to LDA, 
a common covariance matrix X. Then the class conditional densities are 

St 

fiJk,^ix\ck) = y^^Tijk<l?nix{x\(^k,Ckj) for k= (4) 

7=1 

where denotes the set of all subclass means in class and Tijp^ represents the prob- 
ability P{ckj\ck)- The densities Q,c<;y) of the mixture components depend 

on Ck- Hence, as in the case of the HM Classifier, the class conditional densities are 
described by separate mixtures. 
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Parameters and priors are estimated based on maximum likelihood. In contrast to the 
hierarchical approach taken by the HM Classiher, the MDA likelihood is maximized 
directly using the EM algorithm. 

Let X € EDA can be used as a tool for dimension reduction by choosing a 
subspace of rank p* < mm{p^K — 1} that maximally separates the class centers. 
Hastie and Tibshirani (1996), p. 160, show that for MDA a dimension reduction sim- 
ilar to EDA can be achieved by maximizing the log likelihood under the constraint 
rank{jU^y} = p* with p* < min{/?,5— 1}. 

2.4 Localized LDA - LLDA 

The Localized LDA (Czogiel et al. (2006)) relies on an idea of Tutz and Binder 
(2005). They suggest the introduction of locally adaptive weights to the training data 
in order to turn global methods into observation specihc approaches that build in- 
dividual classification rules for all observations to be classified. Tutz and Binder 
(2005) consider only two class problems and focus on logistic regression. Czogiel et 
al. (2006) extend their concept of localization to LDA by introducing weights to the 
n nearest neighbors X(i), . . . of the observation x to be classified in the training 
data set. These are given as 



for i = with W representing a kernel function. The Euclidean distance 

dn{x) = \\x(n) ~-^|| to the farthest neighbor denotes the kernel width. The ob- 

tained weights are locally adaptive in the sense that they depend on the Euclidean 
distances of x and the training observations . 

Various kernel functions can be used. Eor the simulation study we choose the kernel 
^y(J') = ®xp(~Y>') that was found to be robust against varying data characteristics by 
Czogiel et al. (2006). The parameter y G has to be optimized. 

Eor each x to be classified we obtain the n nearest neighbors in the training data 
and the corresponding weights i= These are used to compute 

weighted estimates of the class priors, the class centers and the common covariance 
matrix required to calculate the linear discriminant function. The relevant formulas 
are given in Czogiel et al. (2006), p. 135. 



3 Simulation study 

3.1 Data generation, influencing factors and experimental design 

In this work we compare the local classification methods in the presence of local data 
generating processes (LDGPs). In order to simulate data for the case of K classes and 
M LDGPs we use the mixture model 
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Table 1. The chosen levels, coded by -1 and 1, of the influencing factors on the classification 
performances determine the data generating model (equation (6)). The factor PUVAR defines 
the proportion of useless variables that have equal class means and hence do not contribute to 
class separation. 



influencing factor 


model 


factor level 

-1 -fl 


LP 


number of LDGPs 


M 


2 


4 


PLP 


prior probabilities of LDGPs 




unequal 


equal 


DLP 


distance between LDGP centers 


Pkj 


large 


small 


CL 


number of classes 


K 


3 


6 


PCL 


(conditional) prior probabilities of classes 


PkJ 


unequal 


equal 


DCL 


distance between class centers 


Pkj 


large 


small 


VAR 


number of variables 


Pkj, ^kj 


4 


12 


PUVAR 


proportion of useless variables 


Pkj 


0% 


25% 


DEP 


dependency in tbe variables 


^kj 


no 


yes 


DND 


deviation from tbe normal distribution 


T 


no 


yes 



M K 

W A; (■* I CC ;)) (6) 

7=1 k=l 

with and X denoting the sets of all Hkj and 'Lkj and priors Tij and Pkj. The 7 th LDGP 
is described by the local model '^^^iPkjT , 27,7 (x\ck,j)^- The transformation 
of the gaussian mixture densities by the function T allows to produce data from non- 
normal mixtures. In this work we use the system of densities by Johnson (1949) to 
generate deviations from normality in skewness and kurtosis. If T is the identity the 
data generating model equals the hierarchical mixture model in equation ( 2 ) with 
gaussian subgroup densities and Ghm = M. 

We consider ten influencing factors which are given in Table 1. These factors de- 
termine the data generating model. For example the factor PLP, defining the prior 
probabilities of the LDGPs, is related to Jt, in equation ( 6 ) (cp. Table 1). We fix two 
levels for every factor, coded by —1 and 4-1, which are also given in Table 1. In 
general the low level is used for classification problems which should be of lower 
difficulty, whereas the high level leads to situations where the premises of some 
methods are not met (e.g. nonnormal mixture component densities) or the learning 
problem is more complicated (e.g. more variables). For more details concerning the 
choice of the factor levels see Schiffner (2006). 

We use a fractional factorial 2^*'^^-design with tenfold replication leading to 1280 
runs. For every run we construct a training data set with 3000 and a test data set 
containing 1000 observations. 

3.2 Results 

We apply the local classification methods and global LDA to the simulated data sets 
and obtain 1280 test data error rates r, , f = 1, . . . , 1280, for every method. The chosen 
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Table 2. Bayes errors and error rates of all classification methods with the specified param- 
eters and mixture component densities on the 1280 simulated test data sets. denotes the 
coefficients of determination for the linear regressions of the classification performances on 
the influencing factors in Table 1 . 



method 


parameters 


mixture component 
densities 


minimurn 


error rate 

1 mean maximum 




Bayes error 


- 


- 


0.000 


0.026 


0.193 


- 


LDA 


- 


- 


0.000 


0.148 


0.713 


0.901 


CCM 


Ccc = M 


> 

II 

-e- 


0.000 


0.441 


0.821 


0.871 


CC MK 


Gcc =M-K 


fej = 


0.000 


0.054 


0.217 


0.801 


LLDA 


Y = 5, n = 500 




0.000 


0.031 


0.207 


0.869 


MDA 


Sk = M 


- 


0.000 


0.042 


0.205 


0.904 


HM 


Ghm = M 


fQkj ~ 


0.000 


0.036 


0.202 


0.892 



parameters, the group and subgroup densities assumed for the HM and CC Classi- 
fiers and the resulting test data error rates are given in Table 2. The low Bayes errors 
(cp. also Table 2) indicate that there are many easy classification problems. For the 
data sets simulated in this study, in general, the local classification methods perform 
much better than global LDA. An exception is the CC Classifier with M groups, 
CC M, which probably suffers from the common components assumption in com- 
bination with the low number of groups. The HM Classifier is the most flexible of 
the mixture based methods. The underlying model is met in all simulated situations 
where deviations from normality do not occur. Probably for this reason the error rates 
for the HM Classifier are lower than for MDA and the CC Classifiers. 

In order to measure the influence of the factors in Table 1 on the classification per- 
formances of all methods we estimate their main and interaction effects by linear 
regressions of ln(odds(l — r,)) = In ((1 — ri)/r,) C R, i = 1, . . . , 1280, on the coded 
factors. Then an estimated effect of 1, e.g. of factor DND, can be interpreted as an 
increase in proportion of hit rate to error rate by e « 2.7. 

The coefficients of determination, R^, indicate a good fit of the linear models for 
all classification methods (cp. Table 2), hence the estimated factor effects are mean- 
ingful. The estimated main effects are shown in Figure 1. For the most important 
factors CL, DCL and VAR they indicate that a small number of classes, a big distance 
between the class centers and a high number of variables improve the classification 
performances of all methods. 

To assess which classification methods react similarly to changes in data character- 
istics they are clustered based on the Euclidean distances in their estimated main 
and interaction effects. The resulting dendrogram in Figure 2 shows that one group 
is formed by the HM Classifier, MDA and LLDA which also exhibit similarities in 
their theoretical backgrounds. In the second group there are global LDA and the lo- 
cal CC Classifier with MK groups, CC MK. The factors mainly revealing differences 
between CC M, which is isolated in the dendrogram, and the remaining methods are 
CL, DCL, VAR and LP (cp. Figure 1). For the first three factors the absolute effects 
for CC M are much smaller. Additionally, CC M is the only method with a positive 
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Fig. 1. Estimated main effects of the influenc- 
ing factors in Table 1 on the classification per- 
formances of all methods 



Fig. 2. Hierarchical clustering of the classifi- 
cation methods using average linkage based 
on the estimated factor effects 



estimated effect of LP, the number of LDGPs, which probably indicates that a larger 
number of groups improves the classification performance (cp. the error rates of CC 
MK in Table 2). The factor DLP reveals differences between the two groups found 
in the dendrogram. In contrast to the remaining methods, for both CC Classifiers 
as well as LDA small distances between the LDCP centers are advantageous. Local 
modeling is less necessary, if the LDCP centers for individual classes are close to- 
gether and hence, the global and common components based methods perform better 
than in other cases. 

Based on theoretical considerations, the estimated factor effects and the test data er- 
ror rates, we can assess which methods are favorable to some special situations. The 
estimated effects of factor LP and the error rates in Table 2 show that application of 
the CC Classifier can be disadvantageous and is only beneficial in conjunction with 
a big number of groups Gcc which, however, can make the interpretation of the re- 
sults very difficult. However, for large M, problems in the E step of the classical EM 
algorithm can occur for the CC and the HM Classifiers in the gaussian case due to 
singular estimated covariance matrices. Hence, in situations with a large number of 
LDCPs MDA can be favorable because it yields low error rates and is insensible to 
changes of M (cp. Figure 1), probably thanks to the assumption of a common covari- 
ance matrix and dimension reduction. 

A drawback of MDA is that the numbers of subclasses for all K classes have to be 
specified in advance. Because of subgroup-pruning for the HM Classifier only one 
parameter Ghm has to be fixed. 

If deviations from normality occur in the mixture components LLDA can be recom- 
mended since, like CC M, the estimated effect of DND is nearly zero and the test 
data error rates are very small. In contrast to the mixture based methods it is appli- 
cable to data of every structure because it does not assume the presence of groups. 
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subgroups or subclasses. On the other hand, for this reason, the results of LLDA are 
less interpretable. 



4 Summary 

In this paper different types of local classihcation methods, based on mixture models 
or locally adaptive weighting, are compared in case of LDGPs. For the mixture mod- 
els we can distinguish the common components and the separate mixtures approach. 
In general the four local methods considered in this work are appropriate to classifi- 
cation problems in the case of LDGPs and perform much better than global LDA on 
the simulated data sets. However, the common components assumption in conjunc- 
tion with a low number of groups has been found very disadvantageous. The most 
important factors influencing the performances of all methods are the numbers of 
classes and variables as well as the distances between the class centers. Based on all 
estimated factor effects we identified two groups of similar methods. The differences 
are mainly revealed by the factors LP and DLP, both related to the LDGPs. For a 
large number of LDGPs MDA can be recommended. If the mixture components are 
not gaussian LLDA appears to be a good choice. Future work can consist in con- 
sidering robust versions of the compared methods that can better deal, for example, 
with deviations from normality. 
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Abstract. Astronomy is in the age of large scale surveys in which the gathering of multidi- 
mensional data on thousands of millions of objects is now routine. Efficiently processing these 
data - classifying objects, searching for structure, fitting astrophysical models - is a significant 
conceptual (not to mention computational) challenge. While standard statistical methods, such 
as Bayesian clustering, k-nearest neighbours, neural networks and support vector machines, 
have been successfully applied to some areas of astronomy, it is often difficult to incorporate 
domain specific information into these. For example, in astronomy we often have good phys- 
ical models for the objects (e.g. stars) we observe. That is, we can reasonably well predict 
the observables (typically, the stellar spectmm or colours) from the astrophysical parameters 
(APs) we want to infer (such as mass, age and chemical composition). This is the “forward 
model”: The task of classification or parameter estimation is then an inverse problem. In this 
paper, we discuss the particular problem of combining astrometric information, effectively a 
measure of the distance of the source, with spectroscopic information. 



1 Introduction 

Gaia is an ESA astronomical satellite that will be launched in 2011. Its mission is 
to build a three dimensional map of the positions and velocities of a substantial part 
of our galaxy. In addition to the basic position and velocity data, the astrophysical 
nature of the detected objects will be determined. Since Gaia is expected to detect 
upwards of a billion individual objects of various types, and since the mission will 
not use an input catalogue, automated classification and parameterization based on 
the dataset is a crucial part of the mission. 

1.1 Astronomical context 

From galactic rotation curves and other evidence it is believed that most material in 
the universe is comprised of so-called dark matter. The nature of this material is a 
fundamental current question in astronomy. The distribution and properties of the 
dark matter at the time of the formation of our galaxy should leave traces in the dis- 
tribution and dynamics of the stellar population that is observed today. Since heavy 
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elements are formed by nucleosynthesis in the centres of massive stars, and are there- 
fore scarce at early epochs, their relative abundances in stellar atmospheres can be 
used to discriminate between stellar populations on the basis of age. By building up 
a complete picture of a large portion of our galaxy, such tracers of galactic evolution 
can be studied in unprecedented detail. 

1.2 Basic properties of the dataset 

Gaia will detect all point sources down to a hxed limiting brightness. This limit corre- 
sponds to the brightness of the Sun if observed at a distance of approximately 1 1 ,000 
parsecs (35,000 light years, compared the accepted distance to the Galactic centre of 
26,000 light years). The vast majority of detected sources will be stars, but the sam- 
ple will also include several million galaxies and quasars, which are extragalactic 
objects, and many objects from within our own solar system. 

The positions of the various sources on the sky can of course be measured very 
easily. Radial velocities are determined from Doppler shifts of spectral lines observed 
with an onboard spectrometer. Transverse motions on the sky are of the order of a few 
milliarcseconds per year, scaling with distance, and these motions must he mapped 
over the timescale of the mission (5-6 years). Distances are a priori not known and 
are in fact one of the most difficult, and most crucially important, measurements in 
astronomy. Gaia is designed to measure the parallaxes of the stellar sources in order 
to determine distances to nearhy stars. The parallax in question is the result of the 
changing viewpoint of the satellite as the Earth orhits the Sun. An object displaying 
a parallax of one arcsecond relative to distant, negligable-parallax stars, has by defi- 
nition a distance of 1 parsec (3.26 light years). This distance happens to correspond 
roughly to the distance to the nearest stars. The parallax scales linearly with distance 
so that the Sun at a distance of 11,000 parsec (the approximate brightness limit of 
such an object for Gaia) would display a parallax of about 90 microarcseconds (/ias)). 
Gaia is designed to measure parallaxes with a standard error of around 25 /ras, so that 
the parallax-limit roughly corresponds to the brightness-limit for solar type stars. 

As well as position, parallax and transverse motion (proper motion), and the high 
resolution spectra used to determine the radial velocities, the Gaia satellite will return 
low resolution spectra with approximately 96 resolution elements spanning the range 
300-1000 nanometres (roughly from the ultraviolet to the near infrared range). These 
spectra can he used to classify objects according to basic type (galaxies, quasars, 
stars etc) and to then determine the basic parameters of the object (e.g. for stars, the 
effective temperature of the atmosphere). This information is important because the 
nature of the stellar population coupled with the kinematic information constrains 
models of galaxy formation and evolution. 



2 Classification and parametrization 

As the sky is continuously scanned by the satellite’s detectors, sources are detected 
on board and the data (position, low resolution spectra and high resolution spectrum) 
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are extracted from the raw detector output, processed into an efficient form and re- 
turned to the ground station. As the mission proceeds, repeated visits to the same 
area of sky allow the measurement of variations in the positions of sources, which 
are used to build up a model of the proper motions and parallaxes for the full set 
of sources. This leads to a distinction for the data processing between early mission 
data, consisting of the spectra and positions, and late mission data, which includes 
parallaxes and proper motions. The sources should be classified info broad astronom- 
ical classes on the basis of the spectra alone in the early mission, and on the basis of 
the spectra combined with astrometric information in the later part of the mission. 
This classification is important for the astrophysics, but also for the astrometric so- 
lution, since the distant quasars form a distant, essentially fixed (zero parallax and 
zero proper motion, plus or minus measurement errors) population. The early mis- 
sion classifier should feed back the identified extragalactic objects to the astrometric 
processing, and the purer this sample, the better. 

Once the classification is made, sources are fitted with astrophysical models to 
recover various parameters, such as effective surface temperature or atmospheric el- 
ement abundances for stars. The algorithms for this classification and regression are 
in the early stages of development by the data processing consortium. For the classi- 
fication, the algorithm mostly used at this stage is a Support Vector Machine (SVM) 
after Vapnik (1995), taken from the library libSVM (Chang and Lin (2001)), with a 
radial basis function (RBF) kernel. The decision to use SVM for classification is of 
course provisional and other methods may be considered. Synthetic data for training 
and testing the classifier is produced using standard models of various astronomical 
source classes. The multi-class SVM used returns a probability vector containing the 
probabilities that a particular source belongs to each class (Wu and Weng (2005)). 
Sources are classified according to the highest component of the probability vector. 
We are now incorporating into the simulated data values for the parallax and proper 
motion, indicating a distance. The current task is to incorporate this information into 
the classification and regression schemes. 



3 Classification results 

For current purposes, we consider only four classes of astrophysical object; single 
stars and binary stars, both of which belong to the set of objects within our own 
galaxy, and galaxies and quasars, both of which are extragalactic. Two datasets were 
generated, each with a total of 5000 sources split evenly between the four classes 
(i.e. 1250 of each). One set was used as a trianing set for the SVM, the other is a 
test set from which the statistics are compiled. The classification results for the basic 
SVM classifier running on the spectrum only are shown in Table 1. Here, and in 
subsequent experiments, the input data are scaled to have mean of zero and standard 
deviation of one for each bin. The classifier achieves an overall correct classification 
rate of approximately 93%. The main confusion is between single stars and binaries. 

The parallaxes of the simulated data for stars and quasars are shown in Figure 1 . 
The parallax could be included directly into the classifier as a 97th data point for each 
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Table 1. Confusion matrix for the SVM classifier, working on the spectral data without any 
astrometric information. Reading row by row, the matrix shows the percentage of test sources 
which are of a particular type, for example Stars, which are classified as each possible output 
type. The leading diagonal shows the sources that are correctly classified (true positives). The 
off-diagonal elements show the level of contamination (false positives) as a percentage of the 
input source sample. In this test case, the numbers of each class of source were roughly equal 
(just over 1000 each). In the real mission, the number of stars is expected to be three orders of 
magnitude greater than the number of galaxies or quasars. 





Stars 


Binaries 


Quasars 


Galaxies 


Stars 


88.21 


9.27 


2.43 


0.09 


Binaries 


8.67 


91.13 


0.00 


0.20 


Quasars 


2.04 


0.90 


95.77 


1.28 


Galaxies 


0.00 


0.00 


0.62 


99.38 




og(parallax (mas)) 

Fig. 1. The distribution of simulated parallaxes for stars (filled squares) and quasars (- 1 - signs). 



object, alongside the 96 spectral bins. Such a classifier would be expected to perform 
significantly better than spectrum-only version, and indeed it does (Table 2). It might, 
however, be possible to include the parallax in the classification in a way that utilises 
our knowledge of the astrophysical significance of the quantity. Significant values of 
parallax are expected for a subset of the galactic sources, i.e. the stars and binaries. 
Not all stars and binaries will have a detectable parallax, but none of the extragalactic 
sources will. This then suggests a split in the data, based on parallax, into objects that 
are certainly galactic and objects that may belong to any class. 

To implement such a two-stage classifier, we frained fwo separate SVMs, one 
with all four classes, and the other with the galactic sources (stars and binaries) only. 
These SVMs were trained on the spectral data only, not including the parallax. We 
then classified the entire test set with each classifier. For each object, the output from 
each classifier is a four-component probability vector, in the case of the classifier 
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Table 2. Confusion matrix obtained by using the SVM with the parallax information included 
as an additional input. 





Stars 


Binaries 


Quasars 


Galaxies 


Stars 


93.52 


6.03 


0.45 


0.00 


Binaries 


6.38 


93.62 


0.00 


0.00 


Quasars 


0.76 


0.14 


98.91 


0.19 


Galaxies 


0.00 


0.00 


0.41 


99.59 



trained only on galactic sources (stars and binaries), the probabilities for the quasars 
and galaxies are necessarily always zero. Finally, we combined the output probability 
vectors of the two SVMs using a weighting function based on the parallax value. 

If P\ and P 2 are the probability vectors for the galactic and general SVM classifier 
respectively, they are combined to form the output probability as follows; 

F = wPi + (l-w)P 2 , (1) 

w = 0.5(l+tanh((ax5VP)+5)) (2) 

where SNR is the significance of the measured parallax, estimated by assuming that 
the standard error is 25pas. The parameter a is set to 1. and the value of 5 to -5. With 
these values, the function does not produce significant weighting (w « 0.1) toward 
exclusively galactic sources until the parallax rises to four times the standard error. 



Extraga lactic sources 



JC LCl 

d 

<D 

5 






0 100 200 300 



Parallax (uas) 



Fig. 2. The weighting function applied to the extragalactic sources. 



The results of the two-stage classification are shown in Table 3. The leading 
diagonal shows that the completeness at each class is not as good as in the case of 
the single SVM classifier with parallax as discussed above (Table 2), however the 
contamination of the extragalactic sources with misidentified galactic sources has 
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Galactic sources 




00 200 



Parallax (uas) 

Fig. 3. The weighting function for the galactic sources. These sources are distributed through 
a range of parallaxes. 



been strongly reduced - in fact falling to zero for the test sample of 5000 objects. 
As noted above, this is a significant advantage when the galaxies and quasars form 
important classes for determining the astrometric solution, and when there will be 
several hundred times more stars than extragalactic objects in the final sample. 



Table 3. Confusion matrix obtained by using the SVM twice then combining the probabilities 
weighted according to the value of the parallax. 





Stars 


Binaries 


Quasars 


Galaxies 


Stars 


90.82 


9.18 


0.00 


0.00 


Binaries 


8.87 


91.13 


0.00 


0.00 


Quasars 


2.04 


0.90 


95.77 


1.28 


Galaxies 


0.00 


0.00 


0.62 


99.38 



4 Summary 

Since we know the relationship of the observables to the underlying nature of the 
objects in the sample, we are in a position to incorporate this knowledge into the 
classification or regression problems in an informed way, making maximum use of 
this physical knowledge. The goal of this is twofold; Firstly, the addition of domain- 
specific information should improve the predictive accuracy. Second, but not unim- 
portant, is that it allows an interpretation of how the model works: the sensitivities 
of the model observables to a given underlying parameter provide an explicit (and 
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unique) weighting function of the observables. Apart from making the model more 
acceptable (and less like a “black box”), this allows us to identify where we gather 
higher quality data in order to improve performance further. 
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Abstract. A proposal of an extended version of the HINoV method for the identification of 
the noisy variables (Carmone et al. (1999)) for nonmetric, mixed, and symbolic interval data is 
presented in this paper. Proposed modifications are evaluated on simulated data from a variety 
of models. The models contain the known structure of clusters. In addition, the models contain 
a different number of noisy (irrelevant) variables added to obscure the underlying structure to 
be recovered. 



1 Introduction 

Choosing variables is the one of the most important steps in a cluster analysis. Vari- 
ables used in applied clustering should be selected and weighted carefully. In a clus- 
ter analysis we should include only those variables that are believed to help to dis- 
criminate the data (Milligan (1996), p. 348). Two classes of approaches, while choos- 
ing the variables for cluster analysis, can facilitate a cluster recovery in the data (e.g. 
Gnanadesikan et al. (1995); Milligan (1996), pp. 347-352): 

- variable selection (selecting a subset of relevant variables), 

- variable weighting (introducing relative importance of the variables according 
to their weights). 

Carmone et al. (1999) discussed the literature on the variable selection and 
weighting (the characteristics of six methods and their limitations) and proposed the 
HINoV method for the identification of the noisy variables, in the area of the variable 
selection, to remedy problems with these methods. They demonstrated its robustness 
with metric data and A:-means algorithm. The authors suggest further studies of the 
HINoV method with different types of data and other clustering algorithms on p. 
508. 

In this paper we propose extended version of the HINoV method for nonmetric, 
mixed, and symbolic interval data. The proposed modifications are evaluated for 
eight clustering algorithms on simulated data from a variety of models. 
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2 Characteristics of the HINoV method and its modifications 

Algorithm of Heuristic Identification of Noisy Variables (HINoV) method for metric 
data (Carmone et al. (1999)) is following: 

1 . A data matrix \xij] containing n objects and m normalized variables measured 
on a metric scale (i = 1 , . . . , n; 7 = 1 , . . . , m) is a starting point. 

2. Cluster, via kmeans method, the observed data separately for each 7 -th variable 
for a given number of clusters u. It is possible to use clustering methods based on 
a distance matrix (pam or any hierarchical agglomerative method: single, complete, 
average, mcquitty, median, centroid. Ward). 

3. Calculate adjusted Rand indices Rji (j,l = I, . . . , m) for partitions formed from 
all distinct pairs of the m variables (j ^ 1)- Due to a fact that adjusted Rand index is 
symmetrical we need to calculate m{m— l )/2 values. 

4. Construct mxm adjusted Rand matrix (parim). Sum rows or columns for each 

m 

7-th variable Rj, = ^ Rji (topri): 
i=i 

Variable parim topri 



Ml ■ 




R\2 ■ 






'Ru' 


M 2 




Ri\ 


• • ^2m 




Ri. 






Rml Rfnl • 









5. Rank topri values R\,, R 2 ,, . . . , Rm» in a decreasing order (stopri) and plot the 
scree diagram. The size of the topri values indicate a contribution of that variable to 
the cluster structure. A scree diagram identifies sharp changes in the topri values. Rel- 
atively low-valued topri variables (the noisy variables) are identified and eliminated 
from the further analysis (say h variables). 

6 . Run a cluster analysis (based on the same classification method) with the se- 
lected m — h variables. 

The modification of the HINoV method for nonmetric data (where number of ob- 
jects is much more than a number of categories) differs in steps I, 2, and 6 (Walesiak 
(2005)): 

1 . A data matrix [x,,] containing n objects and m ordinal and/or nominal variables 
is a starting point. 

2. For each 7 -th variable we receive natural clusters, where the number of clusters 
equals the number of categories for that variable (for instance five for Likert scale or 
seven for semantic differential scale). 

6 . Run a cluster analysis with one of clustering methods based on a distance 
appropriate to nonmetric data (GDM2 for ordinal data - see Jajuga et al. (2003); 
Sokal and Michener distance for nominal data) with the selected m — h variables. 

The modification of the HINoV method for symbolic interval data differs in steps 
1 and 2 : 

1 . A symbolic data array containing n objects and m symbolic interval variables 
is a starting point. 
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2. Cluster the observed data with one of clustering methods (pam, single, com- 
plete, average, mcquitty, median, centroid. Ward) based on a distance appropriate to 
the symbolic interval data (e.g. Hausdorff distance - see Billard and Diday (2006), 
p. 246) separately for each j-th variable for a given number of clusters u. 

Functions HINoV.Mod and HINoV.Symbolic of clusterSim computer program 
working in R allow adequately using mixed (metric, nonmetric), and the symbolic 
interval data. The proposed modifications of the HINoV method are evaluated on 
simulated data from a variety of models. 



3 Simulation models 

We generate data sets in eleven different scenarios. The models contain the known 
structure of clusters. In the models 2-11 the noisy variables are simulated indepen- 
dently from the uniform distribution. 

Model 1. No cluster structure. 200 observations are simulated from the uniform 
distribution over the unit hypercube in 10 dimensions (see Tibshirani et al [2001], p. 
418). 

Model 2. Two elongated clusters in 5 dimensions (3 noisy variables). Each clus- 
ter contains 50 observations. The observations in each of the two clusters are inde- 
pendent bivariate normal random variables with means (0, 0), (1, 5), and covariance 
matrix (^ii = ^ji = —0.9). 

Model 3. Three elongated clusters in 7 dimensions (5 noisy variables). Each 
cluster is randomly chosen to have 60, 30, 30 observations, and the observations are 
independently drawn from bivariate normal distribution with means (0, 0), (1.5, 7), 
(3, 14) and covariance matrix ^ (a,y = 1, Oji = —0.9). 

Model 4. Three elongated clusters in 10 dimensions (7 noisy variables). Each 
cluster is randomly chosen to have 70, 35, 35 observations, and the observations 
are independently drawn from multivariate normal distribution with means (1.5, 6 , 
-3), (3, 12, - 6 ), (4.5, 18, -9), and identity covariance matrix where G// = 1 
(1 < 7 < 3), Gi2 = Oi3 = —0.9, and 023 = 0.9. 

Model 5. Eive clusters in 3 dimensions that are not well separated (1 noisy vari- 
able). Each cluster contains 25 observations. The observations are independently 
drawn from bivariate normal distribution with means (5, 5), (-3, 3), (3, -3), (0, 0), 
(-5, -5), and identity covariance matrix ^ (g// = 1, Oji = 0.9). 

Model 6. Eive clusters in 5 dimensions that are not well separated (2 noisy vari- 
ables). Each cluster contains 30 observations. The observations are independently 
drawn from multivariate normal distribution with means (5, 5, 5), (-3, 3, -3), (3, -3, 
3), (0, 0, 0), (-5, -5, -5), and covariance matrix where Ojj = 1 (1 < 7 < 3), and 
a,7 = 0.9(l<yV^<3). 

Model 7. Eive clusters in 10 dimensions (8 noisy variables). Each cluster is ran- 
domly chosen to have 50, 20, 20, 20, 20 observations, and the observations are inde- 
pendently drawn from bivariate normal distribution with means (0, 0), (0, 10), (5, 5), 
( 10 , 0 ), ( 10 , 10 ), and identity covariance matrix ^ (ajj = 1 , Oji = 0 ). 
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Model 8 . Five clusters in 9 dimensions (6 noisy variables). Each cluster contains 
30 observations. The observations are independently drawn from multivariate normal 
distribution with means ( 0 , 0 , 0 ), ( 10 , 10 , 10 ), (- 10 , - 10 , - 10 ), ( 10 , - 10 , 10 ), (- 10 , 
10, 10), and identity covariance matrix where Ojj = 3 (1 < y < 3), and Gji = 2 
(1<;V^<3). 

Model 9. Four clusters in 6 dimensions (4 noisy variables). Each cluster is ran- 
domly chosen to have 50, 50, 25, 25 observations, and the observations are indepen- 
dently drawn from bivariate normal distribution with means (-A, 5), (5, 14), (14, 5), 
(5, -A), and identity covariance matrix ^ (o,y = 1, Oji = 0). 

Model 10. Four clusters in 12 dimensions (9 noisy variables). Each cluster con- 
tains 30 observations. The observations are independently drawn from multivariate 
normal distribution with means (-4, 5, -4), (5, 14, 5), (14, 5, 14), (5, -4, 5), and iden- 
tity covariance matrix where Ojj = 1 (1 < y < 3), and 0,7 = 0 (1 < y 7 ^ Z < 3). 

Model 11. Four clusters in 10 dimensions (9 noisy variables). Each cluster con- 
tains 35 observations. The observations on the first variable are independently drawn 
from univariate normal distribution with means -2, 4, 10, 16 respectively, and iden- 
tity variance aj = 0.5 (1 < y < 4). 

Ordinal data. The clusters in models 1-11 contain continuous data and a dis- 
cretization process is performed on each variable to obtain ordinal data. The number 
of categories k determines the width of each class intervals: 



max{x,y) 

i 



mm{xij} 



k. 



Independently for each variable each class interval re- 



ceive category 1 , . . . , A: and the actual value of variable x,y is replaced by these cate- 
gories. In simulation study k = 5 (for k = l we have received similar results). 

Symbolic interval data. To obtain symbolic interval data the data were generated 
for each model twice into sets A and B and minimal (maximal) value of {a,y,A),y } is 
treated as the beginning (the end) of an interval. 

Fifty realizations were generated from each setting. 



4 Discussion on the simulation results 

In testing the robustness of the HINoV modified algorithm using simulated ordi- 
nal or symbolic interval data, the major criterion was the identification of the noisy 
variables. The FIINoV- selected variables contain variables with the highest topri val- 
ues. In models 2-11 the number of nonnoisy variables is known. Due to this fact, in 
simulation study, the number of the HINoV-selected variables equals the number of 
nonnoisy variables in each model. When the noisy variables were identified, the next 
step was to run the one of clustering methods based on distance matrix (pam, single, 
complete, average, mcquitty, median, centroid. Ward) with the nonnoisy subset of 
variables (HINoV-selected variables) and with all variables. Then each clustering re- 
sult was compared with the known cluster structure from models 2-11 using Hubert 
and Arable’s [1985] corrected Rand index (see Table 1 and 2). 

Some conclusions can be drawn from the simulations results: 
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Table 1. Cluster recovery for all variables and HINoV-selected subsets of variables for ordinal 
data (five categories) by experimental model and clustering method 



Model 




1 Clustering method | 




pam 


ward 


single 


complete 


average 


mcquitty 


median 


centroid 


2 


a 


0.38047 


0.53576 


0.00022 


0.11912 


0.42288 


0.25114 


0.00527 


0.00032 


b 


0.84218 


0.90705 


0.72206 


0.12010 


0.99680 


0.41796 


0.30451 


0.89835 


3 


a 


0.27681 


0.34071 


0.00288 


0.29392 


0.40818 


0.35435 


0.04625 


0.00192 


b 


0.85946 


0.60606 


0.36121 


0.61090 


0.68223 


0.51487 


0.49199 


0.61156 


4 


a 


0.35609 


0.44997 


0.00127 


0.43860 


0.53509 


0.47083 


0.04677 


0.00295 


b 


0.83993 


0.87224 


0.56313 


0.56541 


0.80149 


0.62102 


0.54109 


0.80156 


5 


a 


0.54746 


0.60139 


0.27610 


0.46735 


0.58050 


0.49842 


0.33303 


0.50178 


b 


0.91071 


0.84888 


0.48550 


0.73720 


0.81317 


0.79644 


0.72899 


0.74462 


6 


a 


0.61074 


0.60821 


0.13400 


0.53296 


0.61037 


0.56426 


0.35113 


0.47885 


b 


0.83880 


0.87183 


0.56074 


0.75584 


0.86282 


0.81395 


0.71085 


0.79018 


7 


a 


0.10848 


0.11946 


0.00517 


0.09267 


0.10945 


0.11883 


0.00389 


0.00659 


b 


0.80072 


0.87399 


0.27965 


0.87892 


0.94882 


0.77503 


0.74141 


0.91638 


8 


a 


0.31419 


0.43180 


0.00026 


0.29529 


0.40203 


0.36771 


0.00974 


0.00023 


b 


0.95261 


0.96372 


0.58026 


0.95596 


0.96627 


0.95507 


0.93701 


0.96582 


9 


a 


0.37078 


0.45915 


0.01123 


0.12128 


0.50198 


0.31134 


0.04326 


0.00709 


b 


0.99966 


0.98498 


0.93077 


0.96993 


0.99626 


0.98024 


0.95461 


0.99703 


10 


a 


0.29727 


0.41152 


0.00020 


0.22358 


0.41107 


0.34663 


0.00030 


0.00007 


b 


1.00000 


1.00000 


0.99396 


0.99911 


1.00000 


1.00000 


0.99867 


1.00000 




b 


0.89378 


0.88097 


0.60858 


0.73259 


0.89642 


0.76384 


0.71212 


0.85838 




r 


0.53130 


0.44119 


0.56066 


0.44540 


0.45403 


0.39900 


0.61883 


0.74730 


ccr 




98.22% 


98.00% 


94.44% 


90.67% 


97.11% 


89.56% 


98.89% 


98.44% 


11 


a 


0.04335 


0.04394 


0.00012 


0.04388 


0.03978 


0,03106 


0,00036 


0.00009 


b 


0.14320 


0.08223 


0.12471 


0.08497 


0.10373 


0,12355 


0,04626 


0,06419 



a (b) - values represent Hubert and Arable’s adjusted Rand indices averaged ove£ fifty repli- 
cations for each model with all variables (with HINoV-selected variables); r = b — a; ccr - 
corrected cluster recovery. 



1 . The cluster recovery that used only the HINoV-selected variables for ordinal 
data (Table 1) and symbolic interval data (Table 2) was better than the one that used 
all variables for all models 2-10 and each clustering method. 

2. Among 450 simulated data sets (nine models with 50 runs) the HINoV method 
was better (see ccr in Table 1 and 2): 

- from 89.56% (mcquitty) to 98.89% (median) of runs for ordinal data, 

- from 91.78% (ward) to 99,78% (centroid) of runs for symbolic interval data. 

3. Figure 1 shows the relationship between the values of adjusted Rand indices 
averaged over fifty replications and models 2-10 with the HINoV-selected variables 
(b) and values showing an improvement (r) of average adjusted Rand indices (cluster 
recovery with the HINoV selected variables against all variables) separately for eight 
clustering methods and types of data (ordinal, symbolic interval). Based on adjusted 
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Table 2. Cluster recovery for all variables and HINoV-selected subsets of variables for sym- 
bolic interval data by experimental model and clustering method 



Model 




1 Clustering method | 




pam 


ward 


single 


complete 


average 


mcquitty 


median 


centroid 


2 


a 


0.86670 


0.87920 


0.08006 


0.28578 


0.32479 


0.49424 


0.02107 


0.00004 


b 


0.99920 


0.97987 


0.91681 


0.99680 


0.99524 


0.98039 


0.85840 


0.95739 


3 


a 


0.41934 


0.39743 


0.00368 


0.37361 


0.38831 


0.36597 


0.00088 


0.00476 


b 


1.00000 


1.00000 


1.00000 


1.00000 


1.00000 


1.00000 


0.99062 


1.00000 


4 


a 


0.04896 


0.01641 


0.00269 


0.01653 


-0.00075 


0.01009 


0.00177 


0.00023 


b 


1.00000 


1.00000 


1.00000 


1.00000 


1.00000 


1.00000 


1.00000 


1.00000 


5 


a 


0.71543 


0.70144 


0.73792 


0.47491 


0.60960 


0.53842 


0.34231 


0.28338 


b 


0.99556 


0.99718 


0.98270 


0.91522 


0.99478 


0.99210 


0.90252 


0.97237 


6 


a 


0.75308 


0.67237 


0.33392 


0.47230 


0.67817 


0.55727 


0.18194 


0.10131 


b 


0.99631 


0.99764 


0.99169 


0.95100 


0.98809 


0.97881 


0.84463 


0.99866 


7 


a 


0.36466 


0.51262 


0.00992 


0.32856 


0.33905 


0.39823 


0.00527 


0.00681 


b 


1.00000 


0.99974 


1.00000 


0.98493 


0.99954 


1.00000 


0.99974 


0.99954 


8 


a 


0.74711 


0.85104 


0.01675 


0.50459 


0.51029 


0.61615 


0.00056 


0.00023 


b 


1.00000 


0.99966 


0.99932 


0.99966 


0.99966 


0.99843 


0.99835 


1.00000 


9 


a 


0.86040 


0.90306 


0.30121 


0.26791 


0.54639 


0.62620 


0.00245 


0.00419 


b 


1.00000 


1.00000 


1.00000 


1.00000 


1.00000 


1.00000 


1.00000 


1.00000 


10 


a 


0.70324 


0.91460 


0.00941 


0.48929 


0.47886 


0.54275 


0.00007 


0.00004 


b 


1.00000 


1.00000 


1.00000 


1.00000 


1.00000 


1.00000 


1.00000 


1.00000 




b 


0.99900 


0.99712 


0.98783 


0.98306 


0.99747 


0.99441 


0.95491 


0.99199 




r 


0.39023 


0.34732 


0.82166 


0.62601 


0.56687 


0.53337 


0.89310 


0.94744 


ccr 




94.67% 


91.78% 


97.33% 


99.11% 


96.22% 


96.44% 


99.56% 


99.78% 


11 


a 


0.05334 


0.04188 


0.00007 


0.03389 


0.02904 


0.03313 


0.00009 


0.00004 


b 


0.12282 


0.04339 


0.04590 


0.08259 


0.08427 


0.14440 


0.04380 


0.08438 



a (b); r = b — a\ ccr- see Table 1. 



Rand indices averaged over fifty replications and models 2-10 the improvements in 
cluster recovery (HINoV selected variables against all variables) are varying: 

- for ordinal data from 0.3990 (mcquitty) to 0.7473 (centroid), 

- for symbolic interval data from 0.3473 (ward) to 0.9474 (centroid). 



5 Conclusions 

The HINoV algorithm has limitations for analyzing nonmetric and symbolic interval 
data almost the same as the ones mentioned in Carmone et al. (1999) article for 
metric data. 

First, the HINoV is of a little use with a nonmetric data set or a symbolic data 
array in which all variables are noisy (no cluster structure - see model 1). In this 
situation topri values are similar and close to zero (see Table 3). 
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Fig. 1. The relationship between values of b and r 
Source: own research 



Table 3. Mean and standard deviation of topri values for 10 variables in model 1 



Variable 


1 Ordinal data with five categories I 


1 Symbolic data array I 


mean 


sd 


mean 


sd 


1 


-0.00393 


0.01627 


0.00080 


0.02090 


2 


-0.00175 


0.01736 


0.00322 


0.02154 


3 


0.00082 


0.02009 


0.00179 


0.01740 


4 


-0.00115 


0.01890 


-0.00206 


0.02243 


5 


0.00214 


0.02297 


-0.00025 


0.02074 


6 


0.00690 


0.02030 


-0.00312 


0.02108 


7 


-0.00002 


0.02253 


-0.00440 


0.02044 


8 


0.00106 


0.01754 


0.00359 


0.01994 


9 


0.00442 


0.01998 


0.00394 


0.02617 


10 


-0.00363 


0.01959 


0.00023 


0.02152 



Second, the HINoV method depends on the relationship between pairs of vari- 
ables. If we have only one variable with a cluster structure and the others are noisy, 
the HINoV will not be able to isolate this nonnoisy variable (see Table 4). 

Third, if all variables have the same cluster structure (no noisy variables) the topri 
values will be large and similar for all variables. The suggested selection process 
using a scree diagram will be ineffective. 

Fourth, an important problem is to decide on a proper number of clusters in stage 
two of the HINoV algorithm with symbolic interval data. To resolve this problem we 
should initiate the HINoV algorithm with a different number of clusters. 
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Table 4. Mean and standard deviation of topri values for 10 variables in model 11 



Variable 


1 Ordinal data with five categories | 


1 Symbolic data array | 


mean 


sd 


mean 


sd 


1 


-0.00095 


0.03050 


0.00012 


0.02961 


2 


-0.00198 


0.02891 


0.00070 


0.03243 


3 


0.00078 


0.02937 


-0.00206 


0.02969 


4 


-0.00155 


0.02950 


-0.00070 


0.03185 


5 


0.00056 


0.02997 


-0.00152 


0.03157 


6 


0.00148 


0.03090 


-0.00114 


0.03064 


7 


-0.00246 


0.02959 


-0.00203 


0.03019 


8 


-0.00274 


0.03137 


-0.00186 


0.03021 


9 


-0.00099 


0.02975 


0.00088 


0.03270 


10 


0.00023 


0.02809 


-0.00181 


0.03126 
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Abstract. A conceptual framework for cluster analysis from the viewpoint of p-adic geom- 
etry is introduced by describing the space of all dendrograms for n datapoints and relating 
it to the moduli space of p-adic Riemannian spheres with punctures using a method recently 
applied by Murtagh (2004b). This method embeds a dendrogram as a subtree into the Bruhat- 
Tits tree associated to the p-adic numbers, and goes back to Cornelissen et al. (2001) in p-adic 
geometry. After explaining the definitions, the concept of classifiers is discussed in the con- 
text of moduli spaces, and upper bounds for the number of hidden vertices in dendrograms are 
given. 



1 Introduction 

Dendrograms are ultrametric spaces, and ultrametricity is a pervasive property of 
observational data, and by Murtagh (2004a) this offers computational advantages 
and a well understood basis for developping data processing tools originating in p- 
adic arithmetic. The aim of this article is to show that the foundations can be laid 
much deeper by taking into account a natural object in p-adic geometry, namely the 
Bruhat-Tits tree. This locally finite, regular tree naturally contains the dendrograms 
as subtrees which are uniquely determined by assigning p-adic numbers to data. 
Hence, the classification task is conceptionally reduced to finding a suitable p-adic 
data encoding. Dragovich and Dragovich (2006) find a 5-adic encoding of DNA- 
sequences, and Bradley (2007) shows that strings have natural p-adic encodings. 

The geometric approach makes it possible to treat time-dependent data on an 
equal footing as data that relate only to one instant of time by providing the concept 
of family of dendrograms. Probability distributions on families are then seen as a 
convenient way of describing classifiers. 

Our illustrative toy data set for this article is given as follows: 

Example 1.1 Consider the data set D = {0,1,3,4,12,20,32,64} given by n = % 
natural numbers. We want to hierarchically classify it with respect to the 2-adic 
norm |-|2 our distance function, as defined in Section 2. 
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2 A brief introduction to p-adic geometry 

Euclidean geometry is modelled on the field K of real numbers which are often rep- 
resented as decimals, i.e. expanded in powers of the number 10^*: 

fly G {0,. . . ,9}, m G Z. 

v=m 

In this way, R completes the field Q of rational numbers with respect to the absolute 

{ % X ^ 0 

’ “ . On the other hand, the p-adic norm on Q with 

—X, X < 0 

= |o, , = 0 

is defined for x = ^ by the difference Vp (x) =Vp{a\)—Vp (02) G Z in the multiplic- 
ities with which numerator and denominator of x are divisible by the prime number 
p: Oi = and «,■ not divisible by p, i = 1,2. 

The /7-adic norm satishes the ultmmetric triangle inequality 

|x-fy|p <max{|x|p,|y|p}. 

Completing Q with respect to the p-adic norm yields the field of p-adic numbers 
which is well known to consist of the power series 

x=^avp'', fly € 1}; m G Z. (1) 

v=m 

Note, that the p-adic expansion is in increasing powers of p, whereas in the decimal 
expansion, it is the powers of 10^* which increase arbitrarily. An introduction to 
p-adic numbers is e.g. Gouvea (2003). 

Example 2.1 For our toy data setD, we have |0|2 = 0, | II2 = |3|2 = 1, KI2 = 1 12|2 = 
I2OI2 = 2^^, I32I2 = 2^^, I64I2 = 2^®, i.e. M2 is maximally 1 on D. Other examples: 
|3/2|3 = |6/4|3 = 3-', |20b = 5-i, \p-\ = |p|,' = p. 

Consider the unit disk D = {x G Qp \ \x\p < 1} = Z?i(0). It consists of the so- 
called p-adic integers, and is often denoted as Zp when emphasizing its ring struc- 
ture, i.e. closedness under addition, subtraction and multiplication. A p-adic number 
X lies in an arbitrary closed disk Bp-r{a) = {x G Qp | |x — a|p < p^'^}, where r G Z, 
if and only if x — a is divisible by p''. This condition is equivalent to x and a having 
the first r terms in common in their p-adic expansions (1). The possible radii are all 
integer powers of p, so the disjoint disks i (0),Bp-i (1), . . . ,5p-i (p — 1) are the 
maximal proper subdisks of D, as they correspond to truncating the power series (1) 
after the constant term. There is a unique minimal disk in which D is contained prop- 
erly, namely Bp(0) = {x G Qp | |x|p < p}. These observations hold true for arbitrary 
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/?-adic disks, i.e. any disk Bp-r{x), x G Qp, is partitioned into precisely p maximal 
subdisks and lies properly in a unique minimal disk. Therefore, if we define a graph 
whose vertices are the /7-adic disks, and edges are given by minimal inclusion, 
then every vertex of has precisely p+ \ outgoing edges. In other words, is 
a /? + l-regular tree, and p is the size of the residue field Fp = Zp/ pZp. 

Definition 2.2 The tree is called the Bruhat-Tits tree for Qp. 

Remark 2.3 Definition 2.2 is not the usual way to define The problem with 
this ad-hoc definition is that it does not allow for any action of the projective linear 
group PGL 2 (Qp). A definition invariant under projective linear transformations can 
be found e.g. in Herrlich (1980) or Bradley (2006). 

An important observation is that any infinite descending chain 

B 1 DB 2 D... (2) 

of strictly decreasing ;?-adic disks converges to a unique p-adic number {x} = f]B„. 

n 

A chain (2) defines a halfline in the Bruhat-Tits tree .3^^. Halflines differing only 
by finitely many vertices are said to be equivalent, and the equivalence classes under 
this equivalence relation are called ends. Hence the observation means that the ;?-adic 
numbers correspond to ends of There is a unique end B\ C Bi C . . . coming 
from any strictly increasing sequence of disks. This end corresponds to the point at 
infinity in the p-adic projective line P*(Qp) = Qp U {°°}, whence the well known 
fact: 

Lemma 2.4 The ends of are in one-to-one correspondance with the Qp-rational 
points of the p-adic projective line P', i.e. with the elements o/P^(Qp). 

From the viewpoint of geometry, it is important to distinguish between the /?-adic 
projective line P^ as a p-adic manifold and its set P' (Qp) of Qp-rational points, in the 
same way as one distinguishes between the affine real line A* as a real manifold and 
its rational points A' (Q) = Q, for example. One reason for distinguishing between a 
space and its points is: 

Lemma 2.5 Endowed with the metric topology from \ -\p, the topological space Qp 
is totally disconnected. 

The usual approaches towards defining more useful topologies on p-adic spaces 
are by introducing more points. Such an approach is the Berkovich topology, which 
we will very briefly describe. More details can be found in Berkovich (1990). 

The idea is to allow disks whose radii are arbitrary positive real numbers, not 
merely powers of p as before. Any strictly descending chain of such disks gives a 
point in the sense of Berkovich. For the p-adic line P^ this amounts to: 

Theorem 2.6 (Berkovich) P^ is non-empty, compact, hausdorjf and arc-wise con- 
nected. Every point o/P^ \ {°°} corresponds to a descending sequence B\ D B 2 j) . . . 
of p-adic disks such that B = f]B„ is one of the following: 
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1. apointxinQp, 

2. a closed p-adic disk with radius r € |Qp|p, 
i. a closed p-adic disk with radius r ^ |Qp|p, 

4. empty. 

Points of types 2. to 4. are called generic, points of type 1. classical. We remark 
that Berkovich’s dehnition of points is technically somewhat different and allows 
to define more general /t-adic spaces. Finally, the Bruhat-Tits tree is recovered 
inside P^: 

Theorem 2.7 (Berkovich) is a retract o/P* \ P' (Qp), i.e. there is a map P^ \ 
IP'(Qp) ^ whose restriction to is the identity map on ,5^^. 

3 p-adic dendrograms 
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Fig. 1. 2-adic valuations for D. 



Fig. 2. 2-adic dendrogram for D U {'”}• 



Example 3.1 The 2-adic distances within D are encoded in Figure 1, where 
dist(/,7') = ifV 2 {i,j) is the corresponding entry in Figure 1, using 2^°° = 0. 

Figure 2 is the dendrogram for D using |-| 2 .' the distance between disjoint clusters 
equals the distances between any of their representatives. 

Let X C P* (Qp) be a finite set. By Lemma 2.4, a point of X can be considered as 
an end in 

Definition 3.2 The smallest subtree S>{X) of whose ends are given by X is 
called the p-adic dendrogram for X. 

Cornelissen et al. (2001) use /t-adic dendrograms for studying /t-adic symme- 
tries, cf. also Cornelissen and Kato (2005). We will ignore vertices in St{X) from 
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which precisely two edges emanate. Hence, for example, .^({0, 1,<=°}) consists of 
a unique vertex v(0, 1,°°) and three ends. The dendrogram for a set X C NU {°°} 
containing {0, 1,°°} is a rooted tree with root v(0, 1,°°). 

Example 3.3 The 2-adic dendrogram in Figure 2 is nothing but ^(X) for X = D\J 
{o°} and is in fact inspired by the first dendrogram of Murtagh (2004b). The path 
from the top cluster to Xi yields its binary representation [-]2 which easily translates 
into the 2-adic expansion: 0 = [0000000]2, 64 = [1000000]2 = 2®, 32 = [0100000]2 = 
2^, 4 = [0000100]2 = 22, 20 = [0010100]2 = 2^ + 2^, 12 = [0001100]2 = 2^ + 2^, 
1 = [0000001]2, 3 = [000001 1]2 = 1 + 2‘. 

Any encoding of some data set M which assigns to each x G M a /?-adic repre- 
sentation of an integer including 0 and 1, yields a p-adic dendrogram ^(MU {°°}) 
whose root is v(0, 1 , °°) , and any dendrogram for real data can be embedded in a non- 
unique way into as a /r-adic dendrogram in such a way that v(0, 1,°°) represents 
the top cluster, if p is large enough. In particular, any binary dendrogram is a 2-adic 
dendrogram. However, a little algebra helps to find sufficiently large 2-adic Bruhat- 
Tits trees which allow embeddings of arbitrary dendrograms into In fact, by 
K we mean a finite extension field of Qp. The p-adic norm | • |p extends uniquely to a 
norm | • on for which it is a complete field, called a p-adic number field. The in- 
tegers ofK are again the unit disk Ok = {x & K \ \x\k < 1}, and the role of the prime 
p is played by a so-called uniformiser n G Ok. It has the property that OkItiOk is a 
finite field with q = p^ elements and contains Fp. Hence, if some dendrogram has a 
vertex with maximally n>2 children, then we need K large enough such that 2-f > n. 
This is possible by the results of number theory. Restricting to the prime character- 
istic 2 has not only the advantage of avoiding the need to switch the prime number p 
in the case of more than p children vertices, but also the arithmetic in 2-adic number 
fields is known to be computationally simpler, especially as in our case the so-called 
unramified extensions, i.e. where dimQ^ K = f, are sufficient. 

Example 3.4 According to Bradley (2007), strings over a finite alphabet can be 
encoded in an unramified extension ofQp, and hence be classified p-adically. 



4 The space of dendrograms 

From now on, we will formulate everything for the case K = Qp, bearing in mind 
that all results hold true for general /7-adic number fields K. Let S = {x\,. . .,x„} C 
P* (Qp) consist of n distinct classical points of P^ such that xi = 0, X 2 = I, xj, = °°. 
Similarly as in Theorem 2.7, the / 2 -adic dendrogram ^(S) is a retract of the marked 
projective line Z = P^ \ 5. We call !^{S) the skeleton of X. The space of all projective 
lines with n such markings is denoted by 9Jt„, and the space of corresponding / 2 -adic 
dendrograms by Dn-i. is a / 2 -adic space of dimension n — 3, its skeleton 
is a cw-complex of real polyhedra whose cells of maximal dimension n — 3 consist 
of the binary dendrograms. Neighbouring cells are passed through by contracting 
bounded edges as the n — 3 “free” markings “move” about P^ without colliding. For 
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example, VJI 3 is just a point corresponding to \ {0, 1 , °°}. 9714 has one free marking 
X which can he any -rational point from P^ \ {0, 1,°°}. Hence, the skeleton £>3 is 




Fig. 3. Dendrograms representing the different regions of 2 ) 3 . 



itself a binary dendrogram with precisely one vertex v and three unbounded edges 
A,B,C (cf. Figure 3). For n > 3 there are maps 

fn+l • 9JI/J-I-1 > 9Jl^, tt^n+1 • '^n ^ 1; 

which forget the {n+ l)-st marking. Consider a Qp -rational point x G 97l„, corre- 
sponding to P^ \ S with skeleton d. Its hbre corresponds to P^ \ S' for all 

possible S' whose hrst n entries constitute S. Hence, the extra marking Xg S' \S can 
be taken arbitrarily from P(Qp) \ S. In this way, the space (x) can be considered 
as P^ \ S, and (t/) as the p-adic dendrogram for S. What we have seen is that tak- 
ing fibres recovers the dendrograms corresponding to points in the space Instead 
of fibres of points, one can take hbres of arbitrary subspaces: 

Definition 4.1 A family of dendrograms with n data points over a space F is a map 
Y Tin from some p-adic space Y to 

For example, take Y = {yi, . . . ,yr}- Then a family F ^ is a time series of 
n collision-free particles, if f G { 1 , . . . , F} is interpreted as time variable. It is also 
possible to take into account colliding particles by using compactifications of 9Jl„ as 
described in Bradley (2006). 



5 Distributions on dendrograms 

Given a dendrogram ^ for some data S = {xi, . . . ,x„}, the idea of a classifier is 
to incorporate a further datum x f: S into the classification scheme represented by 
Often this is done by assigning probabilities to the vertices of depending 
on X. The result is then a family of possible dendrograms for S U {x} with a certain 
probability distribution. It is clear that, in the case of p-adic dendrograms, this family 
is nothing but (\>f^^{d) — > T„, ifdG is the point representing This motivates 

the following definition: 

Definition 5.1 A universal p-adic classifier C for n given points is a probability 
distribution on 9Jt„+i . 
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Here, we take on Wln+i the Borel o-algebra associated to the open sets of the 
Berkovich topology. IfxG Tin corresponds to \ 5, then C induces a distribution on 
hence (after renormalisation) a probability distribution on where 

d G Dn-i is the point corresponding to the dendrogram The similar holds true 
for general families of dendrograms, e.g. time series of particles. 



6 Hidden vertices 



A vertex v in a p-adic dendrogram ^ is called hidden, if the class corresponding 
to V is not the top class and does not directly contain data points but is composed 
of non-trivial subclasses. The subforest of & spanned by its hidden vertices will be 
denoted by and is called the hidden part of The number of connected 
components of measures how the clusters corresponding to non-hidden vertices 
are spread within the dendrogram We give bounds for and the number v* of 
hidden vertices, and refer to Bradley (2006) for the combinatorial proofs (Theorems 
8.3 and 8.5). 



Theorem 6.1 Let ‘3) G Then 



v"< 



n + 2 — bn 



and 



^o< 



n — A 
3 ’ 



where the latter bound is sharp. 



1 Conclusions 

Since ultrametricity is the natural property which allows classification and is perva- 
sive in observational data, the techniques of ultrametric analysis and p-adic geometry 
are at ones disposal for identifying and exploiting ultrametricity. A p-adic encoding 
of data provides a way to investigate arithmetic properties of the p-adic numbers 
representing the data. 

It is our aim to lay the geometric foundation towards p-adic data encoding. From 
the geometric point of view it is natural to perform the encoding by embedding its 
underlying dendrogram into the Bruhat-Tits tree. In fact, the dendrogram and its em- 
bedding are uniquely determined by the p-adic numbers representing the data. For 
this end, we give an account of p-adic geometry in order to define p-adic dendro- 
grams as subtrees of the Bruhat-Tits tree. 

In the next step we introduce the space of all dendrograms for a given num- 
ber of data points which, by p-adic geometry, is contained in the space 9Jt„ of all 
marked projective lines, an object appearing in the context of the classification of 
Riemann surfaces. The advantages of considering the space of dendrograms rely on 
the fact that a conceptual formulation of moving particles as families of dendrograms 
is made possible, and its simple geometry as a polyhedral complex. Also, assigning 
distributions on Tin allows for probabilistic incorporation of further data to a given 
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dendrogram. At the end, we give bounds for the numbers of hidden vertices and 
hidden components of dendrograms. 

What remains to do is to computationally exploit the foundations laid in this 
article by developping a code along these lines and apply it to Fionn Murtagh’s task 
of finding ultrametricity in data. 
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Abstract. Forward search (FS) methods have been shown to be usefully employed for detect- 
ing multiple outliers in continuous multivariate data (Hadi, (1994); Atkinson et al, (2004)). 
Starting from an outlier-free subset of observations, they iteratively enlarge this good subset 
using Mahalanobis distances based only on the good observations. In this paper, an alternative 
formulation of the FS paradigm is presented, that takes a mixture of Ai > 1 normal components 
as a null model. The proposal is developed according to both the graphical and the inferen- 
tial approach to FS-based outlier detection. The performance of the method is shown on an 
illustrative example and evaluated on a simulation experiment in the multiple cluster setting. 



1 Introduction 

Mixtures of multivariate normal densities are widely used in cluster analysis, density 
estimation and discriminant analysis, usually resorting to maximum likelihood (ML) 
estimation, via the EM algorithm (for an overview, see McLachlan and Peel, (2000)). 
When the number of components K is treated as hxed, ML estimation is not robust 
against outlying data: a single extreme point can make the parameter estimation of 
at least one of the mixture components break down. Among the solutions presented 
in the literature, the main computable approaches in the multivariate setting are: the 
addition of a noise component modelled as a uniform distribution on the convex hull 
of the data, implemented in the software MCLUST (Fraley and Raftery, (1998)); a mix- 
ture of t-distributions instead of normal distributions, implemented in the software 
EMMIX (McLachlan and Peel, (2000)). According to Hennig, both the alternatives “ 
... do not possess a substantially better breakdown behavior than estimation based on 
normal mixtures" (Hennig, (2004)). 

An alternative approach to the problem is based on the idea that a good outlier 
detection method dehnes a robust estimation method, that works by omitting the 
observations nominated as outliers and computing a standard non-robust estimate 
on the remaining observations. Here, attention is focussed on the so-called /orwani 
search (FS) methods, which have been usefully employed for detecting multiple out- 
liers in continuous multivariate data. These methods are based on the assumption that 
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non-outlying data stem form a multivariate normal distribution or they are roughly 
elliptically symmetric. 

In this paper, an alternative formulation of the FS algorithm is proposed, which is 
specifically designed for situations where non-outlying data stem from a mixture of 
a known number of normal components. It could not only enlarge the applicability 
of FS outlier detection methods, but could also provide a possible strategy for robust 
fitting in multivariate normal mixture models. 



2 The Forward Search 

The Forward search (FS) is a powerful general method for detecting multiple masked 
outliers in continuous multivariate data (Hadi, (1994); Atkinson, (1993)). The search 
starts by fitting the multivariate normal model to a small subset Sm, consisting of m = 
niQ observations, that can be safely presumed to be free of outliers: it can be specified 
by the data analyst or obtained by an algorithm. All n observations are ordered by 
their Mahalanobis distance and Sm is updated as the set of the m + 1 observations 
with the smallest Mahalanobis distances. Then, the number m is increased by 1 and 
the search goes on, by fitting the normal model to the current subset Sm and updating 
Sm as stated above - so that its size is increased by one unit at a time - until Sm 
includes all n observations (that is, m = n). 

By ordering the data according to their closeness to the fitted model (by means 
of Mahalanobis distance), the various steps of the search provide subsets which are 
designed to be outlier-free, until there remain only outliers to be included. The in- 
clusion of outlying observations can be signalled by following two main approaches. 
The former consists in graphically monitoring the values of suitable statistics during 
the search, such as the minimum squared Mahalanobis distance amongst units not 
included in subset Sm (for m ranging from mo to n): if it is large, it means that an 
outlier is going to join the subset (for a presentation of FS exploratory techniques, 
see Atkinson et al., (2004)). The latter approach consists in testing the maximum 
squared Mahalanobis distance amongst the observations included in Sm'- if it exceeds 
a given cutoff, then the search stops (before its natural ending) and the tested ob- 
servation is nominated as an outlier together with all observations not yet included 
in Sm (see Hadi, (1994)), for a presentation of the method). 

When non-outlying data stem from a mixture distribution, the Mahalanobis dis- 
tance cannot be generally used as a measure of discrepancy. A proper criterion for 
ordering the units by closeness to the assumed model is required, together with a con- 
sistent method for finding the starting subset of observations. In this paper a novel 
algorithm of sequential point addition is proposed, designed for situations where 
non-outlying data come from a mixture of A" > 1 normal components, with K as- 
sumed to be known. Two possible formulations are presented, each related to one 
of the two aforementioned approaches to FS-based outlier detection, hereafter called 
“graphical" and “inferential", respectively. 
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3 Forward Search and Normal Mixture Models: the graphical 
approach 

We assume that the cf-dimensional random vector X is distributed according to a A" 
component Normal mixture model: 

K 

= (1) 
A:=l 

where each Gaussian density t|)(-) is parameterized by its mean vector G and 
covariance matrix belonging to the set of positive definite dxd matrices, and 
Wk (k= 1, . . . , AT) are mixing proportions; we suppose that some contamination is 
present in the sample. Because of the zero breakdown-point of ML estimators, the 
FS graphical approach can still be useful for outlier detection in normal mixtures, 
provided that the three aspects that make up the search are properly modified: the 
choice of an initial subset, the way we progress in the search and the statistic to be 
monitored during the search. 

Subset Smo could be defined as the union of K subsets, each located well inside 
a single mixture component: each set could be determined by using robust bi-variate 
boxplots or robustly centered ellipses (both described in Atkinson et al. , (2004)) on 
a distinct element of the data partition provided by some robust clustering method. 
This requires that model (1) is a clustering model. As a more general solution, we 
propose to define Sm^ as a subset of high-density observations, since it is unlike that 
outliers lye in high-density regions of For this purpose, a nonparametric density 
estimate is built on the whole data set and the observations v,- (; = 1 , . . . , n) are sorted 
in decreasing order of estimated density. Denoting by q the observation with the 
/-th ordered density (estimated at step 0), we take: 

= (2) 

It is worth noting that nonparametric density estimation is used here in order to 
dampen the effect of outliers. Its use limits the applicability of the proposed method 
to large medium-dimensional datasets; anyway, it is well known that nonparametric 
density estimation is less sensitive to the curse of dimensionality just in the region(s) 
around the mode(s). 

In order to define how to progress in the search, the following criterion is pro- 
posed, for m ranging from niQ to n. Given the current subset Sm, model (1) is fitted 
by the EM algorithm and the parameter estimates {wk^m,Pk,m,^k.m\k = 1, . . . ,/f} are 
obtained. For each observation Xi, the corresponding estimated value of the mixture 
density function 

K 

Pi^i) ~ ^^Wji^fn^{^i\Pk,m^'^k,m) (3) 

k=l 

is taken as a measure of closeness of xi to the htted model. The density values p{xi) 
are then ordered from largest to smallest and the m+1 observations with the high- 
est values are taken to form the new subset Sm+i- This sorting criterion is coherent 
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with (2); moreover, when = 1 it is equivalent, but opposite, to that defined by the 
normalized squared Mahalanobis distance: 

D* {xi\flmXm) = ]^[d\n{2n) + \n{\trn\) + {xi-UmY^m^{xi-flrn)]- (4) 

In elliptical ^T-means clustering, (4) is preferred to the squared Mahalanobis distance 
because of stability reasons. 

In our experiments we found that the inclusion of outlying points can be well 
monitored by plotting the values of the following statistic: 

Sm = -\n{max{p{xi)-J^Sm})- (5) 

It is the negative natural logarithm of the maximum density estimate amongst obser- 
vations not included in the current subset: if an outlier is about to enter, the value 
of Sm will be large relative to the previous ones. When K = \, monitoring (5) is 
equivalent to monitor the minimum value of (4) amongst observations not included 
in Sm- 

The proposed procedure is illustrated on an artificial bi-variate dataset, re- 
ported by Cuesta- Albertos et al. (available at http://personales.unican.es/cuestaj/ 
RobustEstimationMixtures.pdf) as an example where the t-mixture model can fail. 
The main stages of the procedure are shown in Figure 1 : niQ was set equal to 200 
and density estimation has been carried out on the whole data set through a Gaussian 
kernel estimator with “rule of thumb" bandwidth. The forward plot of (5) is reported 
only for the last 100 steps of the search, so that its final part is more legible: it signals 
the Introduction of the first outlying influential observation with a sharp peak, just 
after the inclusion of 600 units in Sm- Stopping the search before the peak provides a 
robust fitting of the mixture, since it is estimated on all observations but the outlying 
ones. Good results were obtained also in case of symmetrical contamination. 

It could be objected that a 4-component mixture would work as well In the exam- 
ple above. However, in our experience we observed also situations where the cluster 
of outliers can be hardly Identified by fitting & K+ 1 -component mixture, since it 
tends to be “picked-up" by a flat component accounting for generic noise (see, for 
instance. Example 3.2 In Cuesta-Albertos et al.). 

Anyway, the graphical exploration technique presented above is prone to errors, 
because not every data set will give rise to an obvious separation between extreme 
points which are outliers and extreme points which are not outliers. For this reason, 
a formulation of the FS in normal mixtures according to the “inferential approach" 
(mentioned in Section 2) should be devised. In the following section, a FS proce- 
dure involving a test about the outlyingness of a point with respect to a mixture is 
presented. 



4 Forward Search and Normal Mixture Models: the inferential 
approach 

The problem of outlier detection from a mixture is considered in McLachlan and 
Basford (1988). Attention is focused on the assessment of whether an observation is 
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Fig. 1. The example from Cuesta- Albertos et air. 20 outliers are added to a sample of 600 
observations. Top right panel shows the contour plot of the density estimate and the mg = 200 
(circled) observations belonging to the starting subset. Bottom left panel reports the monitor- 
ing plot of (5) for m = 520, . . . , 620. The 95% ellipses of the mixture components fitted to Sggo 
are plotted in the last panel. 



atypical of a mixture of K normal populations, fi, . . . ,Pk, on the basis of a set of m 
observations {xhk ',h= l,...,mk,k= where Xhk are known to come from Pk 

and X)f=i = ni. The problem is tackled by assessing how typical the observation 
is of each Pk in turn. 

In case of unclassified data {xj\j = - like the one considered in the 

present paper - McLachlan and Basford suggest that the m observations should be 
first clustered by fitting a Ai-component heteroscedastic normal mixture model. Then, 
the aforementioned comparison of the tested observation to each of the mixture com- 
ponents in turn is applied to the resulting K clusters as if they represent a “true clas- 
sification" of the data. The approach is based on the following distributional results, 
which are derived under the assumption that model (1) is valid: 

for the generic sample observation Xj, the quantity 
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C-^)D{xj-,hXk) 

{Vk + d){mk - \) - mkD{xf,jlk,tk) 



( 6 ) 



has the distribution, where D(xj;fik,^k) = (xj — fij^Y {x j —p-k) denotes the 
squared Mahalanobis distance of Xj from the A:-th cluster, nik is the number of obser- 
vations put in the fcth cluster by the estimated mixture model and Vt = nik — d — 
with k= 

for a new unclassified observation y, the quantity 



mk{Vk+i) 
{mk+ l)d{Vk + d) 



D(y;pk,%) 



( 7 ) 



has the /^,vj,+i distribution, where D{y,pk,^k) denotes the squared Mahalanobis dis- 
tance of y from the ^-th cluster, and Vk and nik are defined as before, with k= 

Therefore, an assessment of how typical an observation z is of the A:-th component 
of the mixture is given by the tail area to the right of the observed value of (6) or 
(7) under the F distribution with the appropriate degrees of freedom, depending on 
whether z belongs to the sample (z = Xj) or not (z = y). Finally, if ak{z) denotes this 
tail area, z is assessed as being atypical of the mixture if 



a{z) = ^ max ^ak{z) < a, 



( 8 ) 



where a is some specified threshold. According to rule (8), z will be labelled as 
outlying of the mixture if it is outlying of all the mixture components. The value of 
a depends on how the presence of apparently atypical observations is handled: the 
more protection is desired against the possible presence of outliers, the higher the 
value of a. 

We present a FS algorithm using the typicality index a(z) as a measure of “close- 
ness" of a generic observation z to the fitted mixture model. For the sake of simplicity, 
the same criterion for selecting Smo described in Section 3 is employed. Then, at each 
step of the search, a iif-component normal mixture model is fitted to the current sub- 
set S„i and the typicality index is computed for each observation x,(; = by 

means of (6) or (7), depending on whether the observation is an element of Sm or an 
element of the remainder of the sample in step m. Then, observations are sorted in 
decreasing order of typicality: denoting by ^ the observation with the i-th ordered 

typicality value (computed on subset Sm), subset Sm is updated as the set of the m+1 
most typical observations: Sm+i = ■ i = 1, . . . 1}. 

If the least typical observation in the newly created subset, that is x^m+i\,m, is 
assessed as being atypical according to rule (8), then the search stops: the tested ob- 
servation is nominated as an outlier, together with all the observations not included 
in the subset. The performance of the FS-procedure based on the “inferential" ap- 
proach has been compared with that of an outlier detection method for clustering 
in the presence of outliers (Hardin and Rocke, 2004). The method starts from a ro- 
bust clustering of the data and involves a testing procedure about the outlyingness 
of the data, which exploits a distributional result for squared Mahalanobis distances 
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based on minimum covariance determinant estimates of location and shape param- 
eters. The comparison has been carried out on a simulation experiment reported in 
Hardin and Rocke’s paper, with N = 100 independent replicates. In d=4 dimensions, 
two groups of 300 observations each are simulated from A^(0,7) and N{2cl,I), re- 
spectively, where c = and 1 is a vector of d ones. Sixty outliers stemming 

from A^(4cl,/) are planted to each dataset, thus placing the cluster of outliers at the 
same distance the clean clusters are separated. By separating two clusters of stan- 
dard normal data at a distance of 2c, we have clusters that do not overlap with high 
probability. The following measures of performance have been used: 



A = 



E^=i Outj 

^ ^out 



TrueOutj 

^ ^out 



(9) 



where nout=60 is the number of planted outliers and Outj {TrueOutj) is the number 
of observations (planted outliers) declared as outliers in the y-th replicate. Perfect 
performance occurs when A = B= 1 . 



Table 1. Results of the simulation experiment. In both the compared procedures a = 0.01. 
The first row is taken from Hardin and Rocke’s paper. 



Technique 


Measures 


of performance 




(A-l)-lOO 


(S-l)-lOO 


Hardin and Rocke 


4.03 


-0.17 


FS-based 


0.01 


-0.05 



In Table 1 the measures of performance are given in terms of distance from 1. 
Both the methods identify all the planted outliers in nearly all replicates. However, 
Hardin and Rocke’s technique seems to have some tendency in identifying a non- 
planted observation as an outlier. The FS-based method performs generally better, 
probably because it exploits the normality assumption on the components of the 
parental mixture density, by means of the typicality measure a(-). It is expected to 
be preferable also in case of highly overlapping mixture components, since Hardin 
and Rocke’s algorithm may fail for clusters with signihcant overlap - as the Authors 
themselves point out. 



5 Concluding remarks and open issues 

One critical aspect of the proposed procedure (and of any FS method, indeed) is the 
choice of the size mo of the initial subset: it should be relatively small so as to avoid 
the initial inclusion of outliers, but also large enough to make stable estimates of the 
mixture parameters. Moreover, McLachlan and Basford’s test for outlier detection 
is known to have poor control over the overall signihcance level; we dealt with the 
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problem by using Bonferroni bounds. The test for outlier detection from a mixture 
proposed by Wang et al. (1997) does not suffer from this drawback but requires boot- 
strap techniques, thus its use in the FS algorithm would increase the computational 
burden of the whole procedure. 

FS methods are naturally computer-intensive methods. In our FS algorithm, time 
savings could come from using the estimation results of step m as an initial value for 
the EM in step m+1. A possible drawback of this solution is that the results of one 
step irreversibly influence the following ones. The problem of improving computa- 
tional efficiency while preserving effectiveness deserves further attention. Finally, 
we assume that the number of mixture components, K, is both fixed and known. In 
our experience, the first assumption seems to be not crucial: when subset does not 
contain data from one component, say g, the first observation from g may be sig- 
nalled by the forward plot, but it can’t appear like an outlier since its inclusion does 
not occur in the final steps of the search. On the contrary, generalizing the procedure 
for K unknown is a rather challenging task, which we are presently working on. 
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Abstract. Multiple Imputation is a frequently used method for dealing with partial nonre- 
sponse. In this paper the use of finite Gaussian mixture models for multiple imputation in a 
Bayesian setting is discussed. Simulation studies are illustrated in order to show performances 
of the proposed method. 



1 Introduction 

Imputation is a common approach to deal with nonresponse in surveys. It consists 
in substituting missing items with plausible values. This approach has been widely 
used because it allows to work with a complete data set so that standard analysis can 
be applied. Despite of this important advantage, the introduction of imputed values 
is not a neutral task. In fact, Imputed values are not really observed and this should 
be explicitly taken into account in statistical inference based on the completed data 
set. If standard methods are applied as if the imputed values were really observed, 
there would be a general overestimate of the precision of the results, resulting, for 
instance, in too narrow confidence intervals. Multiple imputation (Rubin, (1987)) is 
a methodology for dealing with this problem. It essentially consists in imputing a 
certain number of times the incomplete data set following specific rules. The result- 
ing completed data set is analysed by standard methods and results are combined in 
order to yield estimates and assessing their precision including the additional source 
of variability due to nonresponse. The multiplicity of completed data sets has the 
role of reflecting the variability due to the imputation mechanism. Although in mul- 
tiple Imputation data normality is frequently assumed, this assumption does not fit 
all situations (e.g., multimodal distributions). Moreover, the analyst who works on 
the completed data set not necessarily will or must be aware of the model used for 
imputation. Thus, problems may arise when the models used by the analyst and by 
the imputer are different. Meng (1994) suggests to use a model for imputation that 
is reasonably accurate and general to overcome this difficulty. To this aim, an in- 
teresting work is that of Paddock (2002) who proposes a nonparametric multiple 
imputation technique based on Polya trees. This technique is appealing since it al- 
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lows to treat continuous and ordinal data, and in some circumstances also categorical 
variables. However, in Paddok’s paper it is shown that, even with nonnormal data, in 
some case the technique based on normality is still quite better. Nonnormal data can 
be dealt with by using finite mixtures of Gaussian distributions (GMM) since they 
are flexible enough to approximate a wide class of density functions with a limited 
number of parameters. These models can be seen as generalizations of the general 
location model used by Little and Rubin (2002) to model partially observed data 
with mixed categorical and continuous variables. Unlike in the latter case, however, 
in the present approach categorical variables are latent variables (‘class labels’ that 
are never observed), and their role is merely to allow better approximation of the 
true data distrihution. The performance of GMM in a likelihood based approach for 
single imputation is evaluated in Di Zio et al. (2007). In this paper we discuss the 
use of finite mixtures of Gaussian distrihutions for multiple imputation in a Bayesian 
framework. The paper is structured as follows. Section 2 describes multiple impu- 
tation through mixture models. In Section 3, the problem of label switching is dis- 
cussed. Section 4 is devoted to the description and discussion of the experiments 
carried out in order to assess the performance of the proposed method. 



2 Multiple imputation 

Multiple Imputation has been proposed for both frequentist and Bayesian analy- 
ses. Nevertheless, the theoretical justification is most easily understood from the 
Bayesian perspective. In this setting, the ultimate goal is to fill in missing values 
Ymis with values jmis drawn from the predictive distribution that, once an appropri- 
ate prior distribution for <I> is set, can be written as 

P{^mis\yobs) = J P{^mis\yobs,^)P{^\yobs)d^ ( 1 ) 

where Ymis are the missing values and Y^tis the observed ones. The imputation 
process is repeated m times, so m completed data sets are obtained. These m dif- 
ferent data sets incorporate the uncertainty about the missing imputed values. Let 
us suppose that Q{Y) is the quantity of interest, e.g., a population mean, and that 
an estimate is computed on the ith completed data set, for i= 

The final estimate Q is defined by Q = estimate t of the 

variance of Q can be obtained by combining a within component term U and a 
between component term B. The former is the average of the m standard vari- 
ance estimates {/(') for complete data computed on the ith completed data set, for 
i = U = - I The between variance is the variance of the m esti- 

mates, i.e. B = ~ Q)^- Finally, the total variance of Q is estimated by 

t = U -|-(1 and a 95% confidence interval for Q is given by Q±tv.o. 975 F^/^, 

where the degrees of freedom are v = (m— 1){1 -|- [(1 + m^^)B\^^U}, (see Rubin, 
1987). 
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Since it is often difficult to obtain a closed form for the observed posterior distri- 
bution the data augmentation algorithm may be used (Tanner and Wong, 

1987). This algorithm consists of iterating the two following steps: 

1. I-step - draw f„,is from T’(Y™. 5 |yote,‘I') 

2. P-step - draw <I> from P{^\f mis, yobs)- 

This is a Gibbs sampling algorithm and, after convergence, the resulting sequence of 
values fmis can be thought of as generated from P{Ymis\yobs)- Data augmentation is 
explicitly described by Schafer (1997) when data follow a Gaussian distribution. We 
study the case when data are generated from a finite mixture of K Gaussian distribu- 
tions, i.e., when each observation y, for i= 1, ... ,n is supposed to be a realization of 
a p-dimensional r.v. Y, with density: 

K 

k=\ 



where = ^,'kik > 0 for k = and Np{yi\Qt) is the Gaussian density 

with parameters 0^^ = (fik,'^k)- Note that <I> denotes the full set of parameters: <I> = 

(jti,...jt/f;0i,...,0,f). 

Mixture models have a natural missing data formulation if we suppose that each 
observation y, comes from a specific but unknown component k of the mixture, and 
introduce, for each unit i, an indicator or allocation variable Z,, taking values in 
{ 1 , . . . , }, with Zi = A: if individual i belongs to group k. The discrete variables Z, are 
independently distributed according to P{Zi = k|<I>) = Jt^;, (/ = 1, . . . ,n;k = 1, . . . ,K). 
Furthermore, conditional on Z, = k, the observations y; are supposed to be i.i.d. from 
the density Np{yi\Qk)- Thus, if some items are missing for the ith unit, the relevant 
distribution, conditional on Z, = k, is P{Ymis\yobs,Qk), while the classihcation prob- 
abilities, expressed in terms of yi^obs, are: 



Tgi = P{Zi = g\yi^obs,^) 



^gNp ( yi'.oto 1 0g) _ j ^ 

J2f^iTikNp{yi^obs\Qk)' ^ 



( 2 ) 



where Np(yipbsj6g) is the Gaussian marginal distribution of the gth mixture compo- 
nent of the variables observed in the ith unit. 

The previous formulation leads to a data augmentation algorithm consisting, at 
the tth iteration, of the following two steps: 

• I-step: for i = 

- draw a random value of the allocation variable from the distribution 
P(Z,|y,- i.e., select a value in {1,...,A'} using the probabilities 
Ti,, . . . ,ZKi defined in formula (2) expressed in terms of the current value of 
vector 

- draw (the missing part of the ith vector yf ^ ) from P(y,>« I zf ^ , yi,obs , ) ■ 

• P-step: 

draw from the distribution P(<I>|yofo,y^|j. 
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The above scheme produces a sequence which is a Markov chain 

with stationary distribution P{Z,Ymis,^\yobs)- The convergence properties of the al- 
gorithm have been studied by Diebolt and Robert (1994) in the case of completely 
observed data. 

The choice of an appropriate prior is a critical issue in Gaussian mixture models. 
For instance, reference priors lead to improper priors for the specific component 
parameters that are independent across the mixture components. This situation is 
problematic insofar posterior distributions remain improper for configurations where 
no units are assigned to some components. In this paper we follow a hierarchical 
Bayesian approach, based on weakly informative priors, as introduced by Richardson 
and Green (1997) for univariate mixtures, and generalized to the multivariate case 
by Stephen (2000). In this approach it is assumed that the prior distribution for pi^ is 
rather flat over an interval of variation of the data. The hierarchical structure of the 
prior distributions for a /^-component p-variate Gaussian mixture is given by: 

X,i||3~IT(2a,(2|3)-') 

3~IT(25,(2/i)-i) 
jt ~ D(y), 

where W and D denote the Wishart and Dirichlet distributions respectively, and the 
hyperparameters ^,'¥,a,5,h,y, are constants defined below. Let Rj be the length 
of the observed interval of variation (range) of the obtained valu s for the vari- 
able Yj, and the corresponding midpoint (y = 1, . . . ,p). Then, ^ is the p-vector: 
(^ 1 , . . . ,^p), while the matrix T' is the diagonal matrix whose element \|/yy is RJ^. 
The other hyperparameters are specified as follows: 

a = p+l, 5 = a/10, h=l0'¥, 

The P-step described in general above in this section, with = 

can be implemented by sampling from 
the appropriate posterior distributions as follows: 

~ IT |^25 + 2ga,(2/j + 2^xW^V^ , 

D(Y + ni,...,Y + «/t), 

~ tr [2a + n,,(2p('+')+ -/if , 

\ i'.zi=k ) 

where | • • • denotes conditioning on all other variables. In the previous formulas nj. 
denotes the number of units assigned to the k‘^ mixture component at the t‘^ step, 
and y^ is the mean: Y.r.zi=kYilnk- 
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3 Label switching 

Label switching is a typical problem in Bayesian estimation of finite mixture mod- 
els (Stephens, (2000)). When using symmetric priors (i.e., invariant with respect 
to permutations of the components), the posterior distributions are still symmet- 
ric and thus the marginal posterior distributions for the parameters will be identi- 
cal for all the mixture components. Inference based on MCMC is meaningless, be- 
cause it results in averaging over different mixture components. Nevertheless, this 
problem does not affect inference on parameters that are independent of label com- 
ponents. For instance, if the parameter to be estimated is the population mean, as 
often required in official statistics, the target quantity is independent of the com- 
ponent labels. Moreover, in multiple imputation, the estimate is computed on the 
observed and imputed values, and the imputed values are drawn from P{Ymis\yobs) 
that is invariant with respect to permutation of component labels. As an illustra- 
tive example, we have drawn 200 random samples from the two-component mixture 
f{y) = 0.5N(1.3,0.1) -f 0.5N(2,0.15) in M', and nonresponse is artihcially intro- 
duced with a 20% missing rate. This dataset is multiply imputed according to the 
algorithm previously described. In Figure 1 the trace plot of the component means 
obtained via data augmentation, and of the sample mean that is used to produce mul- 
tiple imputation estimates are shown (5000 iterations). In the figure, the component 
means of the generating mixture distribution (dashed lines) are also reported. More- 
over vertical lines, corresponding to label switching, are depicted. It is worth to note 
that the label switching of the component means does not affect the target estimate 
that in fact is stable. 
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Fig. 1. Trace plot of the two-component means and the sample means computed through the 
data augmentation algorithm. 
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4 Simulation study and results 

We present a simulation study to assess the performance of Bayesian GMM for mul- 
tiple imputation. In order to mimic the situtation in official statistics, a sample of 
N = 50000 units (representing the finite population) with three variables (Yi,Y 2 ,Y^) 
is drawn from a probability model. The target parameter is the mean of the variables 
in the finite population. A random sample u without replacement of n = 1000 units is 
drawn from the reference population. This sample is corrupted by the introduction of 
missing values according to a Missing at Random mechanism (MAR). Missing items 
are introduced for the variables (T2,T3) depending on the observed values yi of the 
variable Yi under the assumption that the higher the value of Yi the higher is the 
nonresponse propensity. Denoting by qt the ith quartile of the empirical distribution 
of Ti, the nonresponse probabilities for (T2,T3) are 0.1 ifyi <gi,0.2ifyi G 
0.4ifyi G [<?2,?3) and 0.5 ifyi > ^3. 

The sample u is multiply imputed (m=5) via GMM. Data augmentation algorithm 
is initialized by using maximum likelihood estimates (MLE) obtained through the 
EM algorithm as described in Di Zio et al. (2007). After a burn-in period of 500 iter- 
ations, multiple imputation is performed by subsampling the chain every t iterations, 
that is, the Ymis used for imputation are those referring to the iterations (t,2t, . . ,,5t). 
Subsampling is used to avoid dependent samples, as suggested by Schafer (1997). 
Although the burn-in period may appear to be not very long, as again suggested by 
Schafer (1997), the initialization of the algorithm with a good starting point (e.g., 
through MLE) may speed up the convergence of the chain. This is also confirmed by 
analysing the trace plot of the parameters. 

Once the data set is imputed, for each analysed variable, the estimate of the mean, 
its variance, and the corresponding 95% confidence interval for the mean are com- 
puted by applying the multiple imputation formulas to the usual Horvitz-Thompson 
estimator Y = y, and to its estimated variance Var{Y) = (^ — where is the 
sample variance. The estimates are compared to the true mean value of the popu- 
lation by computing the square difference, and verifying whether the true value is 
included in the confidence interval. Taking the population fixed, the experiment is 
repeated 1000 times, and the results are averaged over these iterations. The results 
give simulated MSE, bias, simulated coverage corresponding to a 95% nominal level, 
and average length of the confidence intervals. 

This simulation scheme is applied in two settings. In the first, the population is 
drawn from a two-component Gaussian mixture, with mixing parameter Jt = 0.75, 
mean vectors = (0, 0, 0)', 112 = (3,5,8)', and covariance matrices 

/3.0 2.4 2.4\ /4.0 2.4 2.4\ 

El = 2.4 3.0 2.1 , X2 = 2.4 3.5 2.1 . 

\2.4 2.1 1.3 ) \2.4 2.1 3.2 ) 

In the second setting, the population is generated from the Cheriyan and Ram- 
abhadran’s multivariate Gamma distribution described in Kotz et al. (2000) pp. 454- 
456. In order to draw a sample of a 3-variate random vector (Ti, 72,^3) from such 
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a distribution the following procedure is adopted. First, we consider 4 independent 
random variables X,- in K' for t = 0, 1,2, 3 that are distributed according to Gamma 
distributions characterised by different parameters 0,. Then, the 3-variate random 
vector is obtained combining the so that Yi = Xq + X{ for i= 1,2,3. The values of 
the parameters are 0 = (1,0. 2, 0.2, 0.4)'. 

In the two-component Gaussian mixture population, multiple imputation is car- 
ried out according to a plain normal model (hereafter NM) and a mixture of two 
Gaussian components (M 2 ). The results for the variable Yj are illustrated in Table 
1 . For the Gamma population, multiple imputation is performed by using the plain 
normal model (NM) and a ^f-component mixture Mk for K =2,3,4. Results for the 
variable Yj are provided in Table 2. 



Table 1. Results of the experiment where population is based on a two-component Gaussian 
mixture 



Mod 


bias 


MSE 


S.Cov 


Length 


NM 


-0.0144 


0.1323 


93.7% 


0.5000 


M 2 


0.0014 


0.1316 


94.9% 


0.5163 



Table 2. Results of the experiment where population is based on Multivariate Gamma 



Mod 


bias 


MSE 


S.Cov 


Length 


NM 


0.0015 


0.0431 


93.8% 


0.1604 


M 2 


0.0052 


0.0437 


94.0% 


0.1661 


M 2 


0.0043 


0.0435 


94.0% 


0.1651 


M 4 


0.0059 


0.0442 


94.1% 


0.1655 



Results show that confidence intervals are close to the nominal coverage. In par- 
ticular, in the first experiment, the confidence interval computed by the mixture mod- 
els is better than that computed through a Gaussian distribution. The improvement 
is due to the fact that the model used for estimation is correctly specified. This sug- 
gests the need of improving estimation of unknown distribution by means of mixture 
models. To this aim it could be an important step to consider the number of mixture 
components as a random variable, thus incorporating the model uncertainty in the 
estimation phase. 
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Abstract. The analysis of genetic diseases has classically been directed towards establishing 
direct links between cause, a genetic variation, and effect, the observable deviation of phe- 
notype. For complex diseases which are caused by multiple factors and which show a wide 
spread of variations in the phenotypes this is unlikely to succeed. One example is the Atten- 
tion Deficit Hyperactivity Disorder (ADHD), where it is expected that phenotypic variations 
will be caused by the overlapping effects of several distinct genetic mechanisms. The classical 
statistical models to cope with overlapping subgroups are mixture models, essentially convex 
combinations of density functions, which allow inference of descriptive models from data as 
well as the deduction of groups. An extension of conventional mixtures with attractive prop- 
erties for clustering is the context-specific independence (CSI) framework. CSI allows for an 
automatic adaption of model complexity to avoid overfitting and yields a highly descriptive 
model. 



1 Introduction 

The attention deficit hyperactivity disorder (ADHD) is diagnosed in 3% - 5% of all 
children in the US and is considered to be the most common neurobehavioral dis- 
order in children. Today ADHD is known to be influenced by a multitude of factors 
such as genetic disposition, neurological properties and environmental conditions 
(Swanson et al. (2000a), Woodruff et al. (2004)). The phenotypes usually associ- 
ated with ADHD fall into the general categories inattentiveness, hyperactivity and 
impulsivity. This is only a partial list of symptoms associated with ADHD and it is 
noteworthy that most patients will only show some of these behaviors, with differing 
degrees. This wide spread of observable symptoms associated with ADHD supports 
the notion that possible ADHD subtypes will have complex characteristics and may 
contain overlaps. Since ADHD has a complex non-mendelian mode of inheritance 
a partition of phenotypes into clearly separated groups cannot be expected. Rather 
some phenotypic variations will be caused by several distinct genetic mechanisms. 
The neurotransmitter dopamine and the genes involved in dopamine function are 
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known to be relevant to ADHD (Gill et al. (1997)). According to the prevalent the- 
ory (Cook et al. (1995)), the contribution of the dopamine metabolism to ADHD is 
based on over-activity of dopamine transporters in the pre-synaptic membrane which 
leads to reduced dopamine concentrations in the synaptic gap. There have been stud- 
ies linking the disposition towards ADHD with the genotypes of a variable number 
of tandem repeats ( VNTR) region on the third exon of the dopamine receptor gene 
DRD4 (Swanson et al. (2000b)). Considering all this, it seems promising to explore 
the influences of different dopamine receptor haplotypes on ADHD related pheno- 
types and the sub group decompositions implicit in these relationships. For complex 
genetic diseases such as ADHD for which the degree of diagnostic uncertainty with 
respect to presence of the disease and determination of the disease subtype is large, 
the search for simple, direct causalities between different factors is likely to fail (Luft 
(2000)). Rather one would expect to find correlations in the form of changes in dispo- 
sition for a specific disease feature. When attempting to cluster data from such a com- 
plex disease, it is important that the clustering method can accommodate this kind of 
uncertainty. The classical statistical approach in this situation is mixture modelling. 
An extension of the conventional mixture framework are the context-specific inde- 
pendence (CSI) mixture models (Barash and Friedman (2002), Georgi and Schliep 
(2006)). In a CSI model the number of parameters used, i.e. the model complexity, 
is automatically adapted to match the level of variability present in the data. 

In this paper we present a CSI mixture model-based clustering of a data set of 
ADHD patients that consists of both genotypic and phenotypic features. The data 
set includes 134 samples with 91 genotypic variables and 27 phenotypic variables 
each. The genotype variables contain variable number of tandem repeats (VNTR) 
information on the DRD4 gene as well as Single Nucleotide Polymorphism (SNP) 
data on four dopamine receptor (DRD1-DRD3,DRD5) and one dopamine transporter 
(DATl) genes. The DRD family proteins are G-protein coupled dopamine receptors 
located in the plasma membrane. DATl encodes for a dopamine transporter located 
in the presynaptic membrane. The phenotypes are represented by two IQ and three 
achievement test scores, as well as 21 diagnoses for various comorbid behavioral 
disorders. 



2 Methods 

Let Ai,...,Ap be discrete random variables. Given a data set D of N realizations, 
D = x\, ...^xn with Xi = (x, 1 , • . • , Xip ) a conventional mixture density (see McLachlan 
and D. Peel (2000) for details) is given by: 

K 

P{xi) = fk{xi',Qk), (1) 

k=l 

where the Hk are non-negative the mixture coefficients, ^f^itik = 1 and each 
component distribution 
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P 

fk{xr,Qk) = Y[ Ai (xijAkj) ( 2 ) 

.7=1 

is a product distribution over Xi,...,Xp parameterized by parameters 
Qit = Aki, ■■■Akp)- In other words, we assume conditional independence between 
features within the mixture components and adopt the Naive Bayes model as com- 
ponent distributions. All component distribution parameters are denoted by Qm = 
(Qu-Ak) Finally, the complete parameterizations of the mixture M is then given 
by M = (jt, 0 m). The likelihood of data set D under the mixture M is given by 



N 

P{D\M)=l[P{xi). (3) 

!=1 

that is we have the usual assumption of independence between samples. 

The standard technique for learning the parameters 0 is the Expectation Maxi- 
mization (EM) algorithm (Dempster et al. (1977)). The central quantity for the EM 
based parameter estimation is the posterior of component membership given by 



'tt'k AAi'> 

Tl=\'^k AAiAkY 



( 4 ) 



i.e. Tik is the probability that a sample Xi was generated by component k. Moreover, 
this posterior is used for assigning samples to clusters (i.e. components). This is done 
by assigning a sample to the component with maximal posterior. 




Xi A2 ^3 A4 




Fig. 1. Model structure matrices for a) conventional mixture model and b) CSI mixture model 



The conventional mixture model defined above requires the estimation of one 
set of distribution parameters Qtj per feature and distribution. This is visualized in 
the matrix in Eig. 1 a). This example shows a model with hve components and four 
features. Each cell in the matrix represents one Qj^j. The central idea of the context- 
specific independence extension of the mixture framework is that for many data sets 
it will not be necessary to estimate separate parameters in each feature for all com- 
ponents. Rather one should learn only as many parameters as is justified by the vari- 
ability found in the data. This leads to the kind of matrix shown in Eig. 1 b). Here 
each cell spanning multiple rows represents a single set of parameters for multiple 
components. Eor instance, for feature X\, Ci and C 2 share the same parameters, for 
feature X 2 , C 2 — C 4 have the same parameters and for X 4 all components share a 
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single set of parameters. This modification of the conventional mixture framework 
has a number of attractive properties: The model complexity is reduced as there are 
less free parameters to estimate. Also, if a feature has only a single set of parame- 
ters assigned for all components, (such as X 4 in Fig. 1 b), its contribution in (4) will 
cancel out and it will not affect the clustering. This amounts to a feature selection 
in which the impact of noisy features is negated as an integral part of model train- 
ing. Hence, we can expect a more robust clustering in which the risk of overfitting is 
greatly reduced. Finally, the model structure matrix yields a highly descriptive model 
which facilitates the analysis of a clustering. For instance, the matrix 1 b) shows that 
clusters C4 and C5 are only distinguished by feature X 2 . 

Formally the CSI mixture model is defined as follows: Given the set of com- 
ponent indexes C = and features Xi,...,Xp let G = {g/}Q=i,,,,,p) be the 

CSI structure of the model M. Then gj = (g ji, ■■■g jZj) such that Zj is the number of 
subgroups for Xj and each gjr,r= 1 , . . . , is a subset of component indexes from C. 
That means, each g, is a partition of C into disjunct subsets where each gjr represents 
a subgroup of components with the same distribution for Xj. The CSI mixture dis- 
tribution is then obtained by replacing fkj{xij\Qkj) with fkj{xij’,Qgj(^k)j) in (2) where 
gj{k) = r such that k G gjr. Accordingly 0 m = (Jt,0Xj|ji,., ...,0XpjgpJ is the model 
parametrization. Where Qxj\gjr denotes the different parameter sets in the structure 
for feature j. The complete CSI model M is then given by M = (G,0m)- Note that 
we have covered the CSI mixture model and the structure learning algorithm in some 
more detail in a previous publication (Georgi and Schliep (2006)). 

2.1 Structure Learning 

To learn the CSI structure from data we took a Bayesian approach. That means dif- 
ferent models are scored by their posterior distribution which can be efficiently com- 
puted in the Structural EM framework (Friedman (1998)). The model posterior is 
given by P{M\D) P{D\M)P{M) where P{D\M) is the Bayesian likelihood with 
P{D\M) = P{D\ Q m)P{ 0 m)- P{D\ 6 m) is the mixture likelihood (3) of the data 
evaluated at the maximum aposterior paramters 0 m- 6 m) is a conjugate prior 
over the model parameters. Due to the independence assumptions P( 0 m) decom- 
poses into a product distribution of conjugate priors over Jt and the individual Qxjlgjp ■ 
For discrete distributions the Dirichlet distribution and for Gaussians a Normal 
Inverse-Gamma prior was used. The second term needed to evaluate the model pos- 
terior is the prior over the model structure P(M) which is given by P{M) P{K)P{G) 
with P{K^ oc and P{G) ^ I (X^i . Y < 1 and a < 1 are hyper parameter which 
act as a regularization of the structure learning by introducing a bias towards less 
complex models into the posterior. Here, a and y were chosen as weak priors by the 
heuristic introduced in (Georgi and Schliep (2006)) with a 5 = 0.05. Since exhaustive 
evaluation of all possible structures is infeasible, the structure learning is carried out 
by a straightforward greedy procedure starting from the full structure matrix (again 
refer to (Georgi and Schliep (2006)) for details). 
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3 Results 

We applied the CSI mixture model based clustering to the genotype and phenotype 
data separately, as well as to the fused data set. For each data set we trained models 
with 1 to 10 components and model selection was performed using the Normalized 
Entropy Criterion (NEC) (C. Biernacki (1999)). 

3.1 Genotype clustering 



C1 

C2 

C3 



DAT1 .. DRD1 .. DRD2 .. DRD3 .. DRD5 




Fig. 2. VG2 plot (http://pga.gs.washington.edu/VG2.html) of three clusters out of the 7 
component genotype clustering. The color code is as follows: rare homozygous is shown 
in dark grey , heterozygous in medium grey, common homozygous in light gray and 
missing values in white. It can be seen that the clustering captures strong and distinc- 
tive patterns within the genotypes. The plot for the full clustering can be obtained from 
http://algorithmics. molgen. mpg. de/pymix/genoclust. html. 



The model selection on the genotype data set indicated 7 components to be 
optimal. Three example clusters out of this clustering of the genotypes are visual- 
ized in Fig. 2. The plot for the full clustering is available from our homepage at 
http://algorithmics.molgen.mpg.de/pymix/genoclust.html. In the figure the rare ho- 
mozygous alleles are shown in dark grey, the heterozygous alleles are shown in 
medium grey, the common homozygous alleles in light grey and missing values in 
white. It can be seen that the clustering recovered strong and distinctive patterns 
within the genotypes data. When contrasting the clustering with the linkage disequi- 
librium (LD) found between the loci in the data set one can see a strong agreement 
between high LD loci and loci which are informative for cluster discrimination ac- 
cording to the CSI structure. An interesting observation was that out of the 92 fea- 
tures 71 were found to be uninformative in the CSI structure. In other words only 
features that carried strong discriminative information with respect for the clustering 
were influencing the result. 
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3.2 Phenotype clustering 

For the phenotype data the NEC model selection indicated two and four component 
to be good choices, with the score for two being slightly better. The clusters for the 
two component model could readily be identihed as a high performance and a low 
performance cluster with respect to the IQ (BD, VOC) and achievement (READ- 
ING, MATH, SPELLING) features. In fact, the diagnosis features did not contribute 
strongly to the clustering and most were selected to be uninformative in the CSI 
structure. When considering the four component clustering a more interesting pic- 
ture arose. The distinctive features of the four clusters can be summarized as 

1. high scores (IQ and achievement), high prevalence of ODD, above average gen- 
eral anxiety, slight increase in prevalence for many other disorders, 

2. above average scores, high prevalence of transient and chronic tics, 

3. low performance, little comorbidity, 

4. high performance, little comorbidity. 




Fig. 3. CSI structure matrix for the four component phenotype clustering. Identical colors 
within each column denote shared use of parameters. Uninformative features are depicted in 
white. 



The CSI structure matrix for this clustering is shown in Eig. 3. Identical colors 
within each column of the matrix denote a shared set of parameters. For instance 
one can see that cluster 1 has a unique set of parameters for the feature Oppositional 
Defiancy Disorder (ODD) and general anxiety (GENANX) while the other clusters 
share parameters. This indicates that these two features are distinguishing the cluster 
from the rest of the data set. The same is true for the transient (TIC-TRAN) and 
chronic tics (TIC-CHRON) features in cluster 2. Moreover one can immediately see 
that cluster 3 is characterized by distinct parameters for the IQ and achievement 
features. Finally, one can also consider which features are discriminating different 
clusters. For instance clusters 3 and 4 share parameters for all features but the IQ and 
achievement features. 
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3.3 Joined clustering 

The NEC model selection for the fused data set yielded two clusters to be optimal 
with four being second best. The analysis of the clustering showed that the a small 
number of genotype features dominated the clustering and that in particular all the 
phenotype features were selected to be uninformative. Moreover one could observe 
that the genotype patterns found were more noisy and less distinctive within clusters. 
From these observations we conclude that phenotypes covered in the data set do not 
carry meaningful information about the genotypes and vice versa. 



4 Discussion 

The clustering of geno- and phenotype data separately yielded interesting partitions 
of the data. For the former the clustering captured strong patterns of LD within the 
clusters. For the latter we found sub groups of differing levels of IQ and achievement 
as well as differing degrees of comorbidity. For the fused data set the analysis re- 
vealed that there were no strong correlations between the two sources of data. While 
a positive result in this aspect would have been more interesting, the analysis was 
exploratory in nature. In particular, while the dopamine pathway is known to be rele- 
vant for ADHD, there was no guarantee that the specihc genotypes in the data would 
account for any of the represented phenotypes. As for the CSI mixture method, we 
showed that it is well suited for the analysis of complex biological data sets. The 
interpretation of the CSI matrix as a high level overview of the discriminative in- 
formation of each feature allows for an effortless assessment which features are of 
relevance to specifically characterize a cluster. This greatly facilitates the analysis of 
a clustering result for data sets with a large number of features. 
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Abstract. The so-called noise-component has been introduced by Banfield and Raftery 
(1993) to improve the robustness of cluster analysis based on the normal mixture model. 
The idea is to add a uniform distribution over the convex hull of the data as an additional 
mixture component. While this yields good results in many practical applications, there are 
some problems with the original proposal: 1) As shown by Hennig (2004), the method is not 
breakdown-robust. 2) The original approach doesn’t define a proper ML estimator, and doesn’t 
have satisfactory asymptotic properties. 

We discuss two alternatives. The first one consists of replacing the uniform distribution 
by a fixed constant, modelling an improper uniform distribution that doesn’t depend on the 
data. This can be proven to be more robust, though the choice of the involved tuning constant 
is tricky. The second alternative is to approximate the ML-estimator of a mixture of normals 
with a uniform distribution more precisely than it is done by the “convex hull” approach. The 
approaches are compared by simulations and for a real data example. 



1 Introduction 

Maximum Likelihood (ML)-estimation of a mixture of normal distributions is a 
widely used technique for cluster analysis (see, e.g., Fraley and Raftery (1998)). 
Banfield and Raftery (1993) introduced the term “model-based cluster analysis” for 
such methods. 

In the present paper we are concerned with an idea for improving the robustness 
of these estimators against outliers and points not belonging to any cluster. For the 
sake of simplicity, we only deal with one-dimensional data here, but the theoretical 
results carry over easily to multivariate models. See Section 6 for a discussion of 
computational issues in the multivariate case. 

Observations , . . . are modelled as i.i.d. according to the density 
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= (1) 

./■=1 

where r) = ( 5 , 01 , . . . Oi, . . . . . . ,Jtj) is the parameter vector, the number 

of components s G N may be known or unknown, (aj,aj) pairwise distinct, aj G 
R, Oj >0, Tij > 0, j = 1, ... ,5, X)/=i '^j — 1 a2 the density of the normal 

distribution with mean a and variance a^. Estimators of the parameters are denoted 
by hats. 

There is a problem with the ML-estimation of r). If d, = x, for some i, a mixture 
component j and a, ^ 0, the likelihood converges to infinity and the ML-estimator 
is not properly defined. This has to be prevented by a restriction. Gy > cq > 0 V j for 
a given co or 

— > CO > 0, /,;■= 1,...,^, (2) 

^j 

ensure a well-defined ML-estimator (up to label switching of the components). In 
the present paper we use (2), see Hathaway (1985) for theoretical background. 

Having estimated the parameter vector r| by ML for given s, the points can be 
classified by assigning them to the mixture component for which the estimated a 
posteriori probability pij that x, has been generated by the mixture component j is 
maximized: 



cl{xi) = argmax pij, 

J 

Pij \-^s ii. \ 

Z2k=l 

In cluster analysis, the mixture components are interpreted as clusters, though this 
is somewhat controversial, because mixtures of more than one not well separated 
normal distributions may be unimodal and could look quite homogeneous. 

It is possible to estimate the number of mixture components s by the Bayesian 
Information Criterion BIC (Schwarz (1978)), which is done for example by the add- 
on package “mclust” (Fraley and Raftery (1998)) for the statistical software systems 
R and SPLUS. In the present paper we don’t treat the estimation of s. Note that 
robustness for fixed s is important as well if s is estimated, because the higher s, the 
more problematic the computation of the ML-estimator, and therefore it is important 
to have good robust solutions for small s. 

Figure 1 illustrates the behaviour of the ML-estimator for normal mixtures in 
the presence of outliers. The addition of one extreme point to a data set generated 
from a normal mixture with three mixture components has the effect that the ML 
estimator joins two of the original components and fits the outlier alone by the third 
component. Note that the solution depends on the choice of co in (2), because the 
mixture component to fix the outlier is estimated to have minimum possible variance. 

Various approaches to deal with outliers are suggested in the literature about 
mixture models (note that all of the methods introduced below work for the data in 
Figure 1 in the sense that the outlier on the right side doesn’t affect the classification 
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Fig. 1. Left side: artificial data generated from a mixture of three normals with normal mixture 
ML-fit. Right side: same data with one outlier added at 22 and ML-fit with cq = 0.01. 



of the points on the left side, provided that not too unreasonable tuning constants 
are chosen where needed). Banheld and Raftery (1993) suggested to add a uniform 
distribution over the convex hull (i.e., the range for one-dimensional data) to the 
normal mixture: 



h (x) = (^) + "0 

,/=i 



1 (x G 



mm ■) -^max 



]) 



^max ^min 



(4) 



X) ;=o = 1 ) ^0 > 0, Xmax and Xmin denote the maximum and minimum of the data. 

The uniform component is called the “noise component”. The parameters tzj, aj and 
Oj can again be estimated by ML (“BR-noise” in the following”). 

As an alternative, McLachlan and Peel (2000) suggest to replace the normal den- 
sities in (1) by the location/scale family defined by ty -distributions (v could be hxed 
or estimated). Other families of distributions yielding more robust ML-estimators 
than the normal could be chosen as well, such as Huber’s least favourable distribu- 
tions as suggested for mixtures by Campbell (1984). 

A further idea is to optimize the log-likelihood of (1) for a trimmed set of points, 
as has already been proposed for the k-means clustering criterion (Cuesta-Albertos, 
Gordaliza and Matran (1997)). 

Conceptually, the noise component approach is very appealing, t-mixtures for- 
mally assign all outliers to mixture components modelling clusters. This is not ap- 
propriate in most situations from a subject-matter perspective, because the idea of an 
outlier is that it is essentially different from the main bulk of the data, which in the 
mixture setup means that it doesn’t belong to any cluster. McLachlan and Peel (2000) 
are aware of this and suggest to classify points in the tail areas of the t-distributions 
as not belonging to the clusters, but mathematically the outliers are still treated as 
generated by the mixture components modelling the clusters. 
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Votes in percent 



Votes in percent 



Fig. 2. Left side: votes for the republican candidate in the 50 states of the USA 1968. Right 
side: fit by mixture of two (thick line) and three (thin line) normals. The symbols indicate the 
classification by two normals. 





Fig. 3. Left side: votes data fitted by a mixture of two r 3 -distributions. Right side: fit by mixture 
of two normals and BR-noise. The symbols indicate the classifications. 



On the other hand, the trimming approach makes a crisp distinction between 
trimmed outliers and “normal” non-outliers, while in reality it is often unclear 
whether points on the borderline of clusters should be classified as outliers or mem- 
bers of the clusters. The smoother mixture approach via estimated a posteriori prob- 
abilities by analogy to (3) applied to (4) seems to be more appropriate in such situ- 
ations, while still implying a conceptual distiction between normal clusters and the 
outlier generating uniform distribution. 

As an illustration, consider the dataset shown on the left side of Figure 2 giving 
the votes in percent for the republican candidate in the 1968 election in the USA 
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(taken from the add-on package “cluster” for R). The main hulk of the data can be 
roughly separated into two normally looking clusters and there are several states on 
the left that look atypical. However, it is not so clear where the main bulk ends and 
states begin to be “outlying”, neither is it clear whether the state with the best result 
for the republican candidate should be considered an outlier. On the right side you 
see ML-fits by normal mixtures. For s = 2 (thick line), one mixture component is 
taken to fit just three outliers on the left, obscuring the fact that two normals would 
yield a much more convincing fit for the vast majority of the higher election results. 
The mixture of three normals (thin line) does a much better job, although it joins 
several points on the left as a third “cluster” that don’t have very much in common 
and don’t look very “normal”. 

The t 3 -mixture ML runs into problems on this dataset. For ^ = 2, it yields a 
spurious mixture component fitting just four packed points (Figure 3, left side). Ac- 
cording to the BIC, this solution is better than the one with s = 3, which is similar 
two the normal mixture with s = 3. On the right side of Figure 3 the fit with the 
noise component approach can be seen, which is similar to three normals in terms of 
point classification, but provides a useful distinction between normal “clusters” and 
uniform “outliers”. 

Another conceptual remark concerns the interpretation of the results. It makes 
a crucial difference whether a mixture is fitted for the sake of density estimation or 
for the sake of clustering. If the main interest is in cluster analysis, it is of major 
importance to interpret the classification and the distinction between “cluster” and 
“outlier” can be very useful. In such a situation the uniform distribution for the noise 
component is not chosen because we really believe that the outliers are uniformly 
distributed, but to mimic the situation that there is no prior information where outliers 
could be and what could be their distributional shape. The uniform distribution can 
then be interpreted as “informationless” in a subjective Bayesian fashion. 

However, if the main interest is density estimation, it is much more important to 
come up with an estimator with a reasonable shape of the density. The discontinuities 
of the uniform may then be judged as unsatisfactory and a mixture of three or even 
four normals may be preferred. In the present paper we focus on the cluster analytical 
interpretation. 

In Section 2, some theoretical shortcomings of the original noise component ap- 
proach are highlighted and two alternatives are proposed, namely replacing the uni- 
form distribution over the range of the data by am improper uniform distribution and 
estimating the range of the uniform component by ML. 

In Section 3, theoretical properties of the different noise component approaches 
are discussed. In Section 4, the computation of the estimators using the EM-algorithm 
is treated and some simulation results are given in Section 5. The paper is concluded 
in Section 6. Note that the theory and simulations in this paper are an overview of 
more detailed results in Pietro Coretto’s forthcoming PhD thesis. Proofs and detailed 
simulation results will be published elsewhere. 
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2 Two variations on the noise component 

2.1 The improper noise component 

Hennig (2004) has derived a robustness theory for mixture estimators based on the fi- 
nite sample addition breakdown point by Donoho and Huber (1983). This breakdown 
point is defined, in general, as the smallest proportion of points that has to be added 
to a dataset in order to make the estimation arbitrarily bad, which is usually defined 
by at least one estimated parameter converging to infinity under a sequence of a fixed 
number of added points. In the mixture setup, Hennig (2004) dehned breakdown as 
aj ^ cty ^ o°, or Tij 0 for at least one of j = I, . . . ,s. Under (4), the uniform 
component is not regarded as interesting on its own, but as a helpful device, and 
its parameters are not included in the breakdown point dehnition. However, Hennig 
(2004) showed that for fixed s the breakdown point not only for the normal mixture- 
ML, but also for the t-mixture-ML and BR-noise is the smallest possible; all these 
methods can be driven to breakdown by adding a single data point. Note, however, 
that a point has to be a very extreme outlier for the noise component and t-mixtures to 
cause trouble, while it’s much easier to drive conventional normal mxtures to break- 
down. 

The main robustness problem with the noise component is that the range of the 
uniform distribution is determined by the most extreme points, and therefore it de- 
pends strongly on where the outliers are. 

A better breakdown behaviour (under some conditions on the dataset, i.e., the 
components have to be well separated in some sense) has been shown by Hennig 
(2004) for a variant in which the noise component is replaced by an improper uniform 
density k over the whole real line: 



.y 

h W = (^) + ^ok. (5) 

7=1 

k has to be chosen in advance, and the other parameters can then be fitted by “pseudo 
ML” (“pseudo” because (5) does not define a proper density and therefore not a 
proper likelihood). There are several possibilities to determine k: 

• a priori by subject matter considerations, deciding about the maximum density 
value for which points cannot be considered anymore to lie in a “cluster”, 

• exploratory, by trying several values and choosing the one yielding the most con- 
vincing solution, 

• estimating k from the data. This is a difficult task, because k is not dehned by a 
proper probability model. Interpreting the improper noise as a technical device to 
ht a good normal mixture for most points, we propose the following technique: 

1. Fit (5) for several values of k. 

2. For every k, perform classihcation according to (3) and remove all points 
classihed as noise. 

3. Fit a simple normal mixture on the remaining (non-noise) points. 




The Noise Component in Model-based Cluster Analysis 133 



4. Choose the k that minimizes the Kolmogorow distance between the empirical 
distribution of the non-noise points and the fit in step 3. Note that this only 
works if all candidate values for k are small enough that a certain minimum 
portion of the data points (50%, say) is classifed as non-noise. 

From a statistical point of view, estimating k is certainly most attractive, but theo- 
retically it is difficult to analyze. Particularly, it requires a new robustness theory 
because the results of Hennig (2004) assume that k is chosen independently of 
the data. The result for the voting data is shown on the left side of Figure 4. k 
is lower than for BR-noise, so that the “borderline points” contribute more to 
the estimation of the normal mixture. The classification is the same. More im- 
provement could be seen if there was a further much more extreme outlier in the 
dataset, for example a negative number caused by a typo. This would affect the 
range of the data strongly, but the improper noise approach would still yield the 
same classification. Some alternative techniques to estimate k are discussed in 
Coretto and Flennig (2007). 



2.2 Maximum likelihood with uniform 



A further problem of BR-noise is that the model (4) is data dependent, and its ML es- 
timator is not ML for any data independent model, particularly not for the following 
one: 

h (■^) = X! (-^) + ”0M*i ,b2 (-^) , (6) 

,/=l 



where the density of a uniform distribution on the interval This 

may come as a surprise, because the range of the data is ML for a single uniform 
distribution, but if it is mixed with some normals, the range of the data is not ML 
anymore for b\ and b 2 , because is nonzero outside [£> 1 ,^ 2 ] • For example, BR- 
noise doesn’t deliver the ML solution for the voting data, which is shown on the 
right side of Figure 4. In order to prevent the likelihood from converging to infinity 
for b 2 — b\ 0, the restriction (2) has to be extended to ap = , the standard 



deviation of the uniform. 

Taking the ML-estimator for (6) is an obvious alternative (“ML-uniform”). For 
the voting data the ML solution to fit the uniform component only on the left side 
seems reasonable. The largest election result is now assigned to one of the normal 
clusters, to the center of which it is much closer than the outliers on the left to the 
other normal cluster. 



3 Some theory 

Here is a very rough overview on some theoretical results which will be published 
elsewhere in detail: 
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Fig. 4. Left side: votes data fitted by (5) with s = 2 and estimated k. Right side: fit by ML for 
(6), s = 2. The symbols indicate the classifications. 



Identifiability. All parameters in model (6) are identifiable. This is not surprising 
because the uniform can be located by the discontinuities in the density (defined 
as the derivative of the cdf), and mixtures of normals are identifiable. The result 
involves a new definition of identifiability for mixtures of different families of 
distributions, see Coretto and Hennig (2006). 

Asymptotics. Note that the results below concern parameters, but asymptotic re- 
sults concerning classification can be derived in a straightforward way from the 
asymptotic behaviour of the parameter estimators. 

BR-noise. m — > <=° l/{xmax — Xmin) 0 whenever s > 0. This means that 
asymptotically the uniform density is estimated to be zero (no points are 
classified as noise), even if the true underlying model is (6) including a uni- 
form. 

ML-uniform. This is consistent for model (6) under (2) including the standard 
deviation of the uniform. However, at least the estimation of bi and b 2 is 
not asymptotically normal because the uniform distribution doesn’t fulfill 
the conditions for asymptotic normality of ML-estimators. 

Improper noise. Unfortunately, even if the density value of the uniform distri- 
bution in (6) is known to be k, the improper noise approach doesn’t deliver 
a consistent estimate for the normal parameters in (6). Its asymptotics con- 
cerning the canonical parameters estimated by (5), i.e., the value of its “pop- 
ulation version”, is currently investigated. 

Robustness. Unfortunately, ML-uniform is not robust according to the breakdown 
definition given by Hennig (2004). It can be driven to breakdown by two extreme 
points in the same way BR-noise can be driven to breakdown by one extreme 
point, because if two outliers are added on both sides of the original dataset, 
BR-noise becomes ML for (6). 
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The improper noise approach with estimated k is robust against the addition 
of extreme outliers under a sensible initial range of k. Its precise robustness 
properties still have to be investigated. 



4 The EM-algorithm 

Nowadays, the ML-estimator for mixtures is often computed by the EM-algorithm, 
which is shown in various settings to increase the likelihood in every iteration, see 
Redner and Walker (1984). The principle is as follows: 

Start with some initial parameter values which may be obtained by an initial parti- 
tion of the data. Then iterate the E-step and the M-step until convergence. 
E-step: compute the posterior probabilities (3), their analogues for the model under 
study, respectively, given the current parameter values. 

M-step: compute component- wise ML-estimators for the parameters from weighted 
data, where the weights are given by the E-step. 

For given k, the improper noise estimator can be computed precisely in the same 
way. The proof in Redner and Walker (1984) carries over even though the estimator 
is only pseudo-ML, because given the data, the improper noise component can be 
replaced by a proper uniform distribution over some set containing all data points 
with a density value of k. 

For ML-uniform it has to be taken into account that the ML-estimator for a single 
uniform distribution is always the range of the data. This means for the EM-algorithm 
that whatever initial interval I is chosen for \b \ , ^ 2 ], the uniform mixture component 
is estimated as the uniform over the range of the data contained in I in the M-step. 
Particularly, if 7 = \xmin,x„uix\, the EM-estimator yields Banfield and Raftery’s noise 
component as ML-estimator, which is indeed a local optimum of the likelihood in 
this sense. Therefore, unfortunately, the EM-algorithm is not informative about the 
parameters of the uniform. 

A reasonable approximation of ML-uniform can only be obtained by starting 
the EM-algorithm several times, either initializing the uniform by all pairs of data 
points, or, if this is computationally not feasible, by choosing an initial grid of data 
points from which all pairs of points are used. This could be for example Xmin,Xmax, 
and all empirical O.lg-quantiles forq= 1, ... ,9, or the range of the data could be 
partitioned into a number of equally long intervals and the data points closest to the 
interval borders could be chosen. The solution maximizing the likelihood can then 
be taken. 



5 Simulations 

Simulations have been carried out to compare the two new proposals ML-uniform 
and improper noise with BR-noise and ML for tv-mixtures. The latter has been car- 
ried out with estimated degrees of freedom v and classification of points as “out- 
liers/noise” in the tail areas of the estimated t-components, according to Chapter 7 
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of McLachlan and Peel (2000). The ML-uniform has been computed based on a grid 
of points as explained in Section 4. 

Data sets have been generated with n = 50, n = 200 and n = 500, and several 
statistics have been recorded. The precise simulation results will be published else- 
where. In the present paper we focus on the average misclassification percentages 
for the datasets with n = 200. Data have been simulated from four different param- 
eter choices of the model (6), which are illustrated in Figure 5. For every model, 70 
repetitions have been run. 



Two outliers 



Wide noise 




-5 0 5 10 15 20 25 




X 



X 



Noise on one side 




Noise in between 




X 



X 



Fig. 5. Simulated models. Note that for the model “2 outliers” the number of points drawn 
from the uniform component has been fixed to 2. 



The misclassification results are given in Table 1. BR-noise yielded the best per- 
formance for the “wide noise” model. This is not surprising, because in this model 
it’s very likely that the most extreme points on both sides are generated by the uni- 
form. With two extreme outliers on one side, it was also optimal. However, it per- 
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Table 1. Average misclassification percentages for n = 200 



Model/method 


BR-noise 


t-mixture 


improper noise 


ML-uniform 


Two outliers 


2.7 


7.3 


3.9 


3.3 


Wide noise 


8.0 


9.6 


8.4 


9.3 


Noise on one side 


10.6 


8.3 


3.6 


5.3 


Noise in between 


00 

bo 


8.7 


5.5 


7.3 



formed much worse in the two models that generated 10% noise at particular places 
(“noise on one side” and “noise in between”). The improper noise approach gen- 
erally performed very well, almost always better than uniform-ML (which was the 
best method for two of the models for n = 500). The t-mixtures-ML didn’t perform 
very well, but this is at least partly due to the fact that all simulated models were 
of the “normal mixture plus uniform”-type. We will also carry out simulations from 
t-mixtures in the future. 



6 Conclusion 

To deal with noise and outliers in cluster analysis, two new methods have been pro- 
posed, which are variants of Banfield and Raftery’s (1993) noise component, namely 
the use of an improper density to model the noise and an ML-estimator for a mixture 
model including a uniform component. Both methods have some theoretical advan- 
tages over BR-noise. Simulations showed a good performance particularly for the 
improper noise component with estimated density value. We find the principle to 
model outliers and noise by an additional (proper or improper) uniform component 
appealing, particularly for cluster analysis applications. It allows a smooth classifi- 
cation of points as “noise” or as belonging to a cluster. 

Of course it is desirable to apply the ideas to multivariate data as well. This is 
possible in a straightforward way for the improper noise approach where k is fixed 
in advance by subject matter considerations. Our proposal to estimate k may work as 
well for moderate dimensionality, but this is still under investigation. 

The ML-uniform approach is problematic in the multivariate setup because of 
the large number of potentially reasonable support sets for the uniform distribution. 
In principle it could be applied by assuming the support of the uniform component 
as rectangular and parallel to the coordinate axes defined by the variables in the data. 
The ML solution could then be approximated by the best of several hyperrectan- 
gles defined by pairs of data points. It remains to see whether this leads to useful 
clusterings. 
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Abstract. An approach for the integration of supervising information into unsupervised clus- 
tering is presented (semi supervised learning). The underlying unsupervised clustering al- 
gorithm is based on swarm technologies from the field of Artificial Life systems. Its basic 
elements are autonomous agents called Databots. Their unsupervised movement patterns cor- 
respond to structural features of a high dimensional data set. Supervising information can be 
easily incorporated in such a system through the implementation of special movement strate- 
gies. These strategies realize given constraints or cluster information. The system has been 
tested on fundamental clustering problems. It outperforms constrained k-means. 



1 Introduction 

For traditional cluster analysis there is usally a large supply of unlabeled data but 
little background information about classes. To generate a complete labeling of 
data can be expensive. Instead, background information might be available as small 
amount of preclassified input samples that can help to guide the cluster analysis. Con- 
sequently, integration of background information into clustering and classification 
techniques has recently become focus of interest. See Zhu (2006) for an overview. 

Retrieval of previously unknown cluster structures, in the sense of multi-mode 
densities, from unclassified and classified data is called semi-supervised clustering. 
In contrast to semi-supervised classification, semi-supervised clustering methods are 
not limited to the class labels given in the preclassified input samples. New classes 
might be discovered, given classes are merged or might be purged. 

A particularly promising approach to unsupervised cluster analysis are systems 
that possess the ability of emergence through self-organization (Ultsch (2007)). This 
means that systems consisting of a huge number of interacting entities may pro- 
duce a new, observable pattern on a higher level. Such patterns are said to emerge 
from the self-organizing entities. A biological example for emergence through self- 
organization is the formation of swarms, e.g. bee swarms or ant colonies. 

An example of such nature-inspired information processing techniques is clus- 
tering with simulated ants. The ACLUSTER system of Ramos and Abraham (2003) 
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is inspired by ant colonies clustering corpses. It consists of a low-dimensional grid 
that only carries pheromone intensities. A set of simulated ants moves on the grid’s 
nodes. The ants are used to cluster data objects that are located on the grid. An ant 
might pick up a data object and drop it later on. Ants are more likely to drop an 
object on a node whose neighbourhood has similar data objects rather than on nodes 
with dissimilar objects. Ants move according to pheromone trails on the grid. 

In this paper we describe a novel approach for semi-supervised clustering that 
is based on our unsupervised learning artificial life system (see Ultsch (2000)). The 
main idea is that a large number of autonomous agents show collective behaviour 
patterns that correspond to structural features of a high dimensional training set. This 
approach turns out to be inherently prepared to incorporate additional information 
from partially labeled data. 



2 Artificial life 

The artifical life system (ALife) is used to cluster a finite high-dimensional training 
set A C R”. It consists of a low-dimensional grid / C and a set B of so-called 
Databots. A Databot carries an input sample of training set X and moves on the 
grid. Formally, a Databot / € 5 is denoted as a triple (v,,m(v,),5,) whereas x,- G 
X is the input sample, m(x,) G I is the Databot’s location on the grid and Si is a 
set of movement programs, so-called strategies. Later on, mapping of data onto the 
low-dimensional grid is used for visualization of distance and density structure as 
described in section 4. 

A strategy s G Si is a function that assigns probabilites to available directions 
of movement (north, east, et cetera). The Databot’s new location m'{xi) is chosen at 




Fig. 1. ALife system: Databots carry high-dimensional data objects while moving on the grid, 
nearby objects are to be mapped on nearby nodes of the low-dimensional grid 
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random according to the strategies’ probabilites. Several strategies are combined into 
a single one by weighted averaging of probabilities. Probabilities of movements are 
to be chosen such that a Databot is more likely to move towards Databots carrying 
similar input samples than towards Databots with dissimilar input samples. This aims 
at creation of a sufficiently topography preserving projection m .X ^ I (see figure 
1). For an overview on strategies see Ultsch (2000). 

A generalized view on strategies for topography preservation is given below. For 
each Databot {xi,m{xi),Si) G B there is a set of bots (friends) it should move 
towards. Flere, the strategy for topography preservation is denoted with sp- Canoni- 
cally, Fi is chosen to be the Databots carrying the k € IN most similar input samples 
with respect to Xi according to a given dissimilarity measure d : X x X ^ IRq^, e.g. 
the euclidean metric on cardinal scaled spaces. Strategy sp assigns probabilites to 
all directions of movements such that m{xi) is more likely to be moved towards 
\^\ other node on the grid. This can easily be achieved, for 
example, by vectorial addition of distances for every direction of movement. Addi- 
tionally, a set of Databots F/ with the most dissimilar input samples with respect to 
Xi might inversely be used such that m(x, ) is moved away from its foes. A showclass 
example for sp is given in figure 2. In analogy to self-organizing maps (Kohonen 
(1982)), the size of set Fj is decreasing over time. This means that Databots adapt a 
global ordering before they adapt to local orderings. 

Strategies are combined by weighted averaging, i.e. probability of movement towards 
direction D G {north, east,...} is p{D) = / S.se 5 , Afi with G [0,1] 

being the weight of strategy s. Linear combination of probabilities is to be preferred 
over multiplicative because of its compensation. Several combinations of strategies 
have intensely been tested. It turned out that for obtaining good results a small 




Fig. 2. Strategies for Databots’ movements: (a) probabilities for directed movements (b) set 
of friends (black) and foes (white), counters resulting from vectorial addition of distances are 
later on normalized to obtain probabilities, e.g. pj^ consists of black northern distances and 
white southern distances 
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amount* of random walk is necessary. This strategy assigns equal probabilities to 
all available directions in order to overcome local optima by the help of randomness. 



3 Semi-supervised artificial life 

As described in section 2, the ALife system produces a vector projection for clus- 
tering purposes using a movement strategy sp depending on set /j. Choice of bots 
'm Fi G B is derived from the input samples’ similarities with respect to x,. This is 
subsumed as unsupervised constraints because T] arises from unlabeled data only. 

Background information about cluster memberships is given as pairwise con- 
straints stating that two input samples G X belong to the same class (must-link) 
or different classes (cannot-link). For each input sample x, this results in two sets: 
ML, C X denotes the samples that are known to belong to the same class whereas 
CL, C X contains all samples from different classes. ML, and CL,- remain empty for 
unclassified input samples. For each x,, vector projection m :X ^ I has to reflect 
this by mapping m{xi) nearby m{MLi) and far from m{CLi). This is subsumed as 
supervised constraints because they arise from preclassifications. 

The sp paradigm for satisfaction of unsupervised constraints and how to combine 
them has already been described in section 2. Same method is applied for satisfaction 
of supervised constraints. This means that an additional strategy sml is introduced 
for Databots carrying preclassified input samples. For such a Databot (x,-,m(x,),6',) 
the set of friends is simply defined as L) = ML,-. According to that strategy, m(x,) is 
more likely to be moved towards than to any other node on the 

grid. This strategy sml is added to other available strategies. Thus, integration of su- 
pervised and unsupervised learning tasks is realized on basis of movement strategies 
for Databots creating a vector projection m. This is referred to as semi-supervised 
learning Databots. The whole system is referred to as semi-supervised ALife (ssAL- 
ife). 

There are at least two strategies that have to be combined for suitable move- 
ment control of semi-supervised learning Databots: the sp strategy concerning un- 
supervised constraints and the sml strategy concerning supervised constraints. An 
adequate proportional weighting of sp and sml strategy can be estimated by several 
methods: Any clustering method can be understood as a classifier whose quality is 
assessable as prediction accuracy. In this case, accuracy means accordance of input 
samples’ preclassifications and final clustering. The suitability of a given propor- 
tional weighting may be evaluated by cross validation methods. Another approach 
is based on two assumptions. First, cluster memberships are rather global than local 
qualities. Second, the ssALife system adapts to global orderings before local ones. 
Therefore, the influence of the sml strategy is constantly decreasing from 100% 
down to 0 over the training process. The latter method was applied in the current 
realization of the ssALife system. 



* usually with an absolute weight of 5% up to 10% 
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4 Semi- Supervised artificial life for cluster analysis 

Since ssALife is not an inherent clustering but vector projection method, its visual- 
ization capabilities are enhanced using structure maps and the U-Matrix method. 

A structure map enhances the regular grid of the ALife system such that each 
node i e 1 contains a high-dimensional codebook vector wi, G R”. Structure maps 
are used for vector projection and quantization purposes, i.e. arbitrary input sam- 
ples X G R" are assigned to nodes with bestmatching codebook vectors bm{x) = 
argmin,g/(f(x,ffi, ) with d being the dissimilarity measure from section 2. For a mean- 
ingful projection the codebook vectors are to be arranged in a topography preserving 
manner. This means that neighbouring nodes i, j usually have got codebook vectors 
mi,mj that are neighbouring in the input space. A popular method to achieve that 
is the Emergent Self-organizing Map (see Ultsch (2003)). In this context, projected 
input samples wi(x,),Vx,- G X from our ssALife system are used for structure map cre- 
ation. A high-dimensional interpolation based on the self-organizing map’s learning 
technique determines the codebook vectors (Kohonen (1982)). 

The U-Matrix (see figure 3 for illustration) is the canonical display of structure 
maps. The local distance structure is displayed on each grid node as a height value 
creating a 3D landscape of the high dimensional data space. Clusters are represented 
as valleys whereas mountain ranges depict cluster boundaries. See Ultsch (2003) for 
an overview. 

Contrairy to common belief, visualizations of structure maps are not clustering 
algorithms. Segmentation of U-Matrix landscapes into clusters has to be done sepa- 
rately. The U*C clustering algorithm uses an entropy-based heuristic in order to au- 
tomatically determine the correct number of clusters (Ultsch and Flerrmann (2006)). 
By the help of the watershed-transformation, a structure map decomposes into sev- 
eral coherent regions called basins. Basins are merged in order to form clusters if 
they share a highly dense region on the structure map. Therefore, U*C combines 
distance and density information for cluster analysis. 



5 Experimental settings and results 

In order to evaluate the clustering and self-organizing abilities of ssALife, its clus- 
tering performance was measured. The main idea is to use data sets on which the 
input samples’ true classification is known in beforehand. Clustering accuracy can 
be evaluated as fraction of correctly classified input samples. The ssALife is tested 
against the well known constrained k-means (COPK-Means) from Wagstaff et al. 
(2001). Lor each data set, both algorithms got 10% of input samples with the true 
classification. The remaining samples are presented as unlabeled data. 

The data comes from the fundamental clustering problem suite (LCPS). This 
is a collection of data sets for testing clustering algorithms. Each data set repre- 
sents a certain problem that arbitrary clustering algorithms shall be able to han- 
dle when facing real world data sets. Lor example, ’’Chainlink”, ’’Atom” and ’’Tar- 
get” contain spatial clusters of linear not separable, i.e. twined, structure. ”Lsun”, 
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’’EngyTime” and ’’Wingnut” consist of density defined clusters. For details see 
http L / /www.mathematik .uni-marburg . de/~databionics. 

Comparative results can be seen in table 1. The ssALife method clearly out- 
performs COPK-Means. COPK-Means suffers from its inability to recognize more 
complex cluster shapes. As an example, the so-called EngyTime data set is shown in 
figure 3. 



Table 1. Percental clustering accuracy: ssALife outperforms COPK-Means, accuracy esti- 
mated on fully classified original data over fifty mns with random initialization 



data set 


COPK-Means 


ssALife with U*C 


Atom 


71 


100 


Chainlink 


65.7 


100 


Hepta 


100 


100 


Lsun 


96.4 


100 


Target 


55.2 


100 


Tetra 


100 


100 


TwoDiamonds 


100 


100 


Wingnut 


93.4 


100 


EngyTime 


90 


96.3 




Fig. 3. Density defined clustering problem EngyTime: (a) partially labeled data (b) ssALife 
produced U-Matrix, clearly visible decision boundary, fully labeled data 



6 Discussion 

In this work we described a first approach of semi-supervised cluster analysis using 
autonomous agents called Databots. To our knowledge, this is the first approach that 
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aims for the realization of semi-supervised learning paradigms on basis of a swarm 
clustering algorithm. 

The ssALife system and Ramos’ ACLUSTER differ in two ways. First, Databots 
can be seen as a bijective mapping of input samples onto locations whereas simu- 
lated ants have no permanent connection to the data. This facilitates the integration 
of additional data-related features into the swarm entities. Furthermore, there is no 
global exchange about topographic information in ACLUSTER, which may lead to 
discontinuous projections of clusters, i.e. projection errors. 

Most popular approaches for semi-supervised learning can be distinguished in 
two groups (Belkin et al. (2006)). The manifold assumption states that input samples 
with equal class labels are located on manifolds or subspaces, respectively, of the 
input space (Belkin et al. (2006), Bilenko et al. (2004)). Recovery of such manifolds 
is accomplished by optimization of an objective function, e.g. for adaption of met- 
rics. The cluster assumption states that input samples in the same cluster are likely 
to have the same class label (Wagstaff et al. (2001), Bilenko et al. (2004)). Again, 
recovery of such clusters is accomplished by optimization of an objective function. 
Such objective functions consist of terms for unsupervised cluster retrieval and a 
loss term that punishes supervised constraint violations. Obviously, the obtainable 
clustering solutions are predetermined by the inherent cluster shape assumption of 
the chosen objective function. For example, k-means like clustering algorithms and 
Mahalanobis like metric adaptions, too, assume linear separable clusters of spheri- 
cal shape and well-behaved density structure. In contrast to that, the ssALife method 
comes up with a simple yet powerful learning procedure based on movement pro- 
grams for autonomous agents. This enables a unification of supervised and unsu- 
pervised learning tasks without the need for a main objective function. Except for 
the used dissimilarity measure, the ssALife system does not rely on such objective 
functions and reaches maximal accuracy on FCPS. 



7 Summary 

In this paper, cluster analysis is presented on basis of a vector projection problem. Su- 
pervised und unsupervised learning of a suitable projection means to incorporate in- 
formation from topography and preclassifications of input samples. In order to solve 
this, a very simple yet powerful enhancement of our ALife system was introduced. 
So-called Databots move the input samples’ projection points on a grid-shaped out- 
put space. Databots’ movements are chosen according to so-called strategies. The 
unifying framework for supervised and unsupervised learning is simply based on 
defining an additional strategy that can incorporate preclassifications into the self- 
organization process. 

From this self-organizing process a non-linear display of the data’s spatial struc- 
ture emerges. The display is used for automatic cluster analysis. The proposed 
method ssALife outperforms a simple yet popular algorithm for semi-supervised 
cluster analysis. 
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Abstract. Euclidean partition dissimilarity P) (Dimitriadou et al., 2002) is defined as the 
square root of the minimal sum of squared differences of the class membership values of the 
partitions P and P, with the minimum taken over all matchings between the classes of the parti- 
tions. We first discuss some theoretical properties of this dissimilarity measure. Then, we look 
at the Euclidean consensus problem for partition ensembles, i.e., the problem to find a hard 
or soft partition P with a given number of classes which minimizes the (possibly weighted) 
sum Ylb^bdiPbtP)^ of squared Euclidean dissimilarities d between P and the elements Pi, 
of the ensemble. This is an NP-hard problem, and related to consensus problems studied in 
Gordon and Vichi (2001). We present an efficient “Alternating Optimization” (AO) heuristic 
for finding P, which iterates between optimally rematching classes for fixed memberships, and 
optimizing class memberships for fixed matchings. An implementation of such AO algorithms 
for consensus partitions is available in the R extension package clue. We illustrate this algo- 
rithm on two data sets (the popular Rosenberg-Kim kinship terms data and a macroeconomic 
one) employed by Gordon & Vichi. 



1 Introduction 

Over the years, a huge number of dissimilarity measures for (hard) partitions has 
been suggested. Day (1981), building on work by Boorman and Arabie (1972), iden- 
tifies two leading groups of such measures. Supervaluation metrics are derived from 
supervaluations on the lattice of partitions. Minimum cost flow (MCF) metrics are 
given by the minimum weighted number of admissible transformations required to 
transform one partition into another. 

One such MCF metric is the /^-metric of Rubin (1967), defined as the “mini- 
mal number of augmentations and removals of single objects” needed to transform 
one partition into another. This equals twice the Boorman- Arabie A (single element 
moves) distance, and is also called transfer distance in Charon et al. (2006) and 
partition-distance in Gusfield (2002). It can be computed by solving the Linear Sum 
Assignment Problem (LSAP) 

min Wki\CkACi\ 

weWA M 
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where “Wa is the set of all matrices W = [wki] with non-negative elements and row all 
column sums all one, and the {Q} and {Q} denote the classes of the first and second 
partition P and P, respectively. The LSAP can be solved efficiently in polynomial 
time using primal-dual algorithms such as the so-called Hungarian method, see e.g. 
Papadimitriou and Steiglitz (1982). 

For possibly soft partitions, as e.g. obtained by fuzzy or model-based mixture 
clustering, the theory of dissimilarities is far less developed. To fix notations and ter- 
minology, let n be the number of objects to be classified. A (possibly soft) partition P 
assigns to each object i and class k a non-negative number m,vt quantifying the “be- 
longingness” or membership of the object to the class, such that = 1- We can 

gather the niit into the membership (matrix) M = M{P) = of the partition. In 
general, M is a stochastic matrix; for hard partitions, it is a binary matrix. Note that 
M is unique up to permutations of its columns. We refer to the number of non-zero 
columns of M as the number of classes of the partition, and write tPv and for the 
space of all (possibly soft) partitions with v classes, and all hard partitions with v 
classes, respectively. 

In what follows, it will often be convenient to bring memberships to “a com- 
mon number of classes” (i.e., columns) by adding trailing zero columns as needed. 
Formally, we can work on the space P of all stochastic matrices with n rows and in- 
finitely many columns, with the normalization that non-zero columns are the leading 
ones. 

For two hard partitions with memberships M and M, we have iQAC/j = 

\^ik ~ for all /7 > 1, as \uY' = \u\ if m G { — 1,0, 1}. This strongly suggests 
to generalize the /^-metric to possibly soft partitions via dissimilarities defined as the 
p-th root of 

WG4Pa * 

Using p = 2 gives Euclidean dissimilarity d (Dimitriadou et al. (2002)). Identify- 
ing the optimal assignment with its corresponding map n (“permutation” in the pos- 
sibly augmented case) of the classes of the first to those of the second partition (i.e., 
n{k) = I iff wid = 1 iff Q is matched with Ci), we can use ^ wti Yi \^ik ~ 1^ = 

E,- Y.k \^ik - mi,n{k) I’’ to obtain 

d{M,M) = min||M — Mn||f 

where the minimum is taken over all permutation matrices fl and = 

(El is the Frobenius norm. See Hornik (2005b) for details. 

For p = 1, we get Manhattan dissimilarity (Hornik, 2005a). For general p and 
W = [wki] constrained to have given row sums <Xk and column sums 3/ (not neces- 
sarily all identical as for the assignment case), we get the Mallows-type distances 
introduced in Zhou et al. (2005), and motivated from formulations of the Monge- 
Kantorovich optimal mass transfer problem. 

Gordon and Vichi (2001, Model 1) introduce a dissimilarity measure also based 
on squared distances between optimally matched columns of the membership matri- 
ces, but ignoring the “unmatched” columns. This will result in discontinuities (with 
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respect to the natural topology on IP) for sequences of membership matrices for 
which at least one column converges to zero. 

In Section 2, we give some theoretical results related to Euclidean partition dis- 
similarity, and present a heuristic for solving the Euclidean consensus problem for 
partition ensembles. Section 3 investigates soft Euclidean consensus partitions for 
two data sets employed in Gordon and Vichi (2001), the popular Rosenberg-Kim 
kinship terms data and a macroeconomic one. 



2 Theory 

2.1 Maximal Euclidean dissimilarity 

Charon et al. (2006) provide closed-form expressions for the maximal R-metric 
(transfer distance) between hard partitions with v and v classes, which readily yield 

= max d{M,M) = \/n-Cmm(v,v), 

MeiPy ,MeiPd 

with the minimum concordance Cmin given in Theorem 2 of Charon et al. (2006). One 
can show (Hornik, 2007b) that the maxima of the Euclidean dissimilarity between 
(possibly soft) partitions can always be attained at the “boundary”, i.e., for hard 
partitions, such that 



max d{M,M)= max d{M,M)=^yy 
mg2’v,mg£Pv MeP^,MeP^ 

E.g., if V < V and (v — l)v < «, then /iv,v = (n — [m/v] ) Note that the dissimilar- 
ities between soft partitions are “typically” much smaller than for hard ones. 

2.2 The Euclidean consensus problem 

Aggregating ensembles of clusterings into a consensus clustering by minimiz- 
ing average dissimilarity has a long history, with key contributions including 
Mirkin (1974), Barthelemy and Monjardet (1981, 1988), and Wakabayashi (1998). 
More generally, clusterwise aggregation of ensembles of relations (thus containing 
equivalence relations, i.e., partitions, as a special case) was introduced by Gaul and 
Schader(1988). 

Given an ensemble (profile) of partitions Pi,...,Pb of the same n objects and 
weights w\,...,wb summing to one, a soft Euclidean consensus partition (general- 
ized mean partition) is defined as a partition which minimizes 

over tPv for given v. Similarly, a hard Euclidean consensus partition minimizes the 
criterion function over . Equivalently, one needs to find 
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miny^ w*tnin|lM — = min min Wb\\M — Mb^b\\% 

M n* M 

over all suitable M and permutation matrices Fli , . . . , rtg. 

Soft Euclidean consensus partitions can be characterized as follow^ (see 
Hornik (2005b)). For fixed rti_j_...,riB, Yl,b'*^b\\M — MbTlbW'^ = ||M — M|||.+ 
'^b’^bWMb^bW'p ~ ll^llf where M = is the weighted mean of the (suit- 

ably matched) memberships. If M is feasible for M (such that v > max(vi , . . . , Vb))> 
the overall minimum sought is found hy 

for a suitable B-dimensional cost array c. This is an instance of the Multi-dimensional 
Assignment Problem (MAP), which is known to be NP-hard. 

For hard partitions M and fixed rti,...,rtg, — MiFtfoll^ = 

|!M||^ — 2^^H’fotrace(M'M/,ri;,) + '^b’^bWMb^bW'p ~ const — 2trace(M'M). As 
trace(M'M) = if again V > max(vi, . . . ,Vb), this can he maximized 

by choosing, J'or each row i, niit = 1 for the first k such that niik is maximal_for the 
i-th row of M. I.e., the optimal M is given by a closest hard partition H{M) of M 
(“winner-takes-all weighted voting”). 

Inserting the optimal M yields that the optimal permutations are found by solving 

n|”“n.S, (si.E/,'*'''"'"'") 

which looks “similar” to, if not worse than, the MAP for the soft case. 

In both cases, we find that determining Euclidean consensus partitions by simul- 
taneous optimization over the memberships M and permutations rti,...,rtg leads 
to very hard comhinatorial optimization problems, for which solutions hy exhaus- 
tive search are only possible for very “small” instances. Hornik and Bohm (2007) 
introduce an “Alternating Optimization” (AO) algorithm based on the natural idea to 
alternate between minimizing the criterion function '^f^Wb\\M — Mb^bW'p 
permutation for fixed M, and over M for fixed permutations. The first amounts to 
solving B (independent) linear sum assignmen^problems, the latter to computing 
suitable approximations to the weighted mean M = WbMbYlb (see above for the 
case where v > max(vi, . . . ,V b); otherwise, one needs to “project” or constrain to 
the space of all M with only v leading non-zero columns). If every update reduces 
the criterion function, converge to a fixed point is ensured (it is currently unknown 
whether these are necessarily local minima of the criterion function). These AO al- 
gorithms, which are implemented as methods "SE" (default) and "HE" of function 
cl_consensus of package clue (Hornik, 2007a), provide efficient heuristics for find- 
ing the global optimum, provided that the best solution found in “sufficiently many” 
replications with random starting values is employed. 
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Table 1. Memberships for the soft Euclidean consensus partition with v = 3 classes for the 
Gordon- Vichi macroeconomic ensemble. 



Argentina 


0.618 


0.374 


0.008 


Norway 


0.082 


0.912 


0.006 


Bolivia 


0.666 


0.056 


0.278 


Portugal 


0.488 


0.452 


0.060 


Canada 


0.018 


0.980 


0.002 


South Africa 


0.626 


0.366 


0.008 


Chile 


0.632 


0.356 


0.012 


Spain 


0.314 


0.658 


0.028 


Egypt 


0.750 


0.070 


0.180 


Sudan 


0.566 


0.088 


0.346 


France 


0.012 


0.988 


0.000 


Sweden 


0.050 


0.944 


0.006 


Greece 


0.736 


0.194 


0.070 


U.K. 


0.112 


0.872 


0.016 


India 


0.542 


0.076 


0.382 


U.S.A. 


0.062 


0.930 


0.008 


Indonesia 


0.616 


0.144 


0.240 


Uruguay 


0.680 


0.310 


0.010 


Italy 


0.044 


0.950 


0.006 


Venezuela 


0.600 


0.390 


0.010 



Japan 0.134 0.846 0.020 



3 Applications 

3.1 Gordon- Vichi macroeconomic ensemble 

Gordon and Vichi (2001, Table 1) provide soft partitions of 21 countries based on 
macroeconomic data for the years 1975, 1980, 1985, 1990, and 1995. These parti- 
tions were obtained using fuzzy c-means on measurements of variables such as an- 
nual per capita gross domestic product (GDP) and the percentage of GDP provided 
by agriculture. The 1980 and 1990 partitions have 3 classes, the remaining ones two. 

Table 1 shows the memberships of the soft Euclidean consensus partition for 
V = 3 based on 1000 replications of the AO algorithm. It can be verified by exhaus- 
tive search (which is feasible as there are at most 6^ = 7776 possible permutation 
sequences) that this is indeed the optimal solution. Interestingly, one can see that 
the maximal membership values are never attained in the third column, such that 
the corresponding closest hard partition (which is also the hard Euclidean consen- 
sus partition) has only 2 classes. One might hypothesize that there is a bias towards 

2- class partitions as these form the majority (3 out of 5) of the data set, and that 

3- class consensus partitions could be obtained by suitably “up-sampling” the 3-class 
partitions, i.e., increasing their weights Wb- Table 2 indicates how a third consensus 
class is formed when giving the 3-class partitions w times the weight of the 2-class 
ones (all these countries are in class 1 for the unweighted consensus partition): The 
order in which countries join this third class (of the least developed countries) agrees 
very well with the “sureness” of their classification in the unweighted consensus, as 
measured by their margins, i.e., the difference between the largest and second largest 
membership values for the respective objects. 

3.2 Rosenberg-Kim Kinship terms data 

Rosenberg and Kim (1975) describe an experiment where perceived similarities of 
the kinship terms were obtained from six different “sorting” experiments. In one of 
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Table 2. Formation of a third class in the Euclidean consensus partitions for the Gordon- Vichi 
macroeconomic ensemble as a function of the weight ratio w between 3- and 2-class partitions 
in the ensemble. 

1.5 India 

2.0 India, Sudan 

3.0 India, Sudan 

4.5 India, Sudan, Bolivia, Indonesia 

10.0 India, Sudan, Bolivia, Indonesia 

12.5 India, Sudan, Bolivia, Indonesia, Egypt 

oo India, Sudan, Bolivia, Indonesia, Egypt 



these, 85 female undergraduates at Rutgers University were asked to sort 15 English 
terms into classes “on the basis of some aspect of meaning”. There are at least three 
“axes” for classification: gender, generation, and direct versus indirect lineage. The 
Euclidean consensus partitions with v = 3 classes put grandparents and grandchil- 
dren in one class and all indirect kins into another one. Eor V = 4, {brother, sister} 
are separated from (father, mother, daughter, son). Table 3 shows the memberships 
for a soft Euclidean consensus partition for v = 5 based on 1000 replications of the 
AO algorithm. 



Table 3. Memberships for the 5-class soft Euclidean consensus partition for the Rosenberg- 
Kim kinship terms data. 



grandfather 


0.000 


0.024 


0.012 


0.965 


0.000 


grandmother 


0.005 


0.134 


0.016 


0.840 


0.005 


granddaughter 


0.113 


0.242 


0.054 


0.466 


0.125 


grandson 


0.134 


0.111 


0.052 


0.581 


0.122 


brother 


0.612 


0.282 


0.024 


0.082 


0.000 


sister 


0.579 


0.391 


0.026 


0.002 


0.002 


father 


0.099 


0.546 


0.122 


0.158 


0.075 


mother 


0.089 


0.654 


0.136 


0.054 


0.066 


daughter 


0.000 


1.000 


0.000 


0.000 


0.000 


son 


0.031 


0.842 


0.007 


0.113 


0.007 


nephew 


0.012 


0.047 


0.424 


0.071 


0.447 


niece 


0.000 


0.129 


0.435 


0.000 


0.435 


cousin 


0.080 


0.056 


0.656 


0.033 


0.174 


aunt 


0.000 


0.071 


0.929 


0.000 


0.000 


uncle 


0.000 


0.000 


0.882 


0.071 


0.047 



Eigure 1 indicates the classes and margins for the 5-class solutions. We see that 
the memberships of ‘niece’ are tied between columns 3 and 5, and that the margin 
of ‘nephew’ is only very small (0.02), suggesting the 4-class solution as the optimal 
Euclidean consensus representation of the ensemble. 
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grandfather 
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Fig. 1. Classes (incicated by plot symbol and class id) and margins (differences between the 
largest and second largest membership values) for the 5-class soft Euclidean consensus parti- 
tion for the Rosenberg-Kim kinship terms data. 



Quite interestingly, none of these consensus partitions split according to gender, 
even though there are such partitions in the data. To take the natural heterogene- 
ity in the data into account, one could try to partition them (perform clusterwise 
aggregation, Gaul and Schader (1988)), resulting in meta-partitions (Gordon and 
Vichi (1998)) of the underlying objects. Function cl_pclust in package clue pro- 
vides an AO heuristic for soft prototype-based partitioning of classihcations, allow- 
ing in particular to obtain soft or hard meta-partitions with soft or hard Euclidean 
consensus partitions as prototypes. 
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Abstract. In developing information systems conceptual models are used for varied purposes. 
Since the modeling process is characterized by interpretation and abstracting the situation at 
hand it is essential to enclose information about the design process the modelers went through. 
This aspect is often discarded. But the lack of this information hinders the reuse of past knowl- 
edge for later, similar problems encountered and supports the repeat of failures. 

The design rationale approaches, discussed in the software engineering community since 
the 1990s, seem to be an effective means to solve these problems. But the semiformal style of 
the rationale models challenges the retrieval of the relevant information. The paper explores 
an approach for classifying issues by its responding alternatives as an access to the complex 
rationale documentation. 



1 Subjectivism in the modeling process 

Our considerations are based on a moderate constructivistic position. This attitude of 
mind has significant consequences on the design of the modeling process as well as 
on the evaluation of the quality of the resulting model. As it is outlined in (Schutte 
and Rotthowe (1998)) a model is a result of a cognitive process done by a modeler, 
who is structuring the considered system according to a specific purpose. Because 
of the differing thought patterns of the stakeholder a consensus about structuring the 
problem domain as well as about the model representation has to be defined. In this 
way the modeling process is a consensus oriented one. 

The definition of the application domain terms is an accepted starting point for 
the process of conceptual modeling (cp. Holten (2003), p. 201). Therefore it is fair 
to assume that no misinterpretation of the applied terminology occurs. 

In order to manage the subjectivity in the modeling process and to support the 
traceability of the conceptualizations done by the model designer, SCHUETTE and 
Rotthowe proposed the Guidelines of Modeling as generic modeling conventions 
(cp. Schutte and Rotthowe (1998)). In doing so they also considered not only the 
significant role of the model designer but also the role of the model user. They claim 
that the model user is only able to interpret the model in a correct way, if he knows 
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the underlying guidelines of the model design (cp. Schiitte and Rotthowe (1998), p. 
242). 

Model designers are facing similar problems in different projects (cp. Fowler 
(1997)). Owing to a lack of an explicit and maintained knowledge base containing 
experiences in model construction and model use, similar problems are solved re- 
peatedly at higher costs than they have to be (cp. Flordijk and Wieringa (2006), p. 
353). 

Due to the subjectivism in the modeling process it is inevitable to externalize the 
assumptions and objectives the model bases on. The traceability of the model con- 
struction is not only relevant for reusing modeling solutions but also for maintaining 
the model itself. Stakeholder, who were not involved in the modeling process, are 
not able to interpret the model in the right way. Particularly with regard to fractional 
changes of the model, the lack of rationale information could have far-reaching con- 
sequences like violating assumptions, constraints or tradeoffs. 

Argumentation based models of design rationale ought to be suitable for solv- 
ing these problems (cp. Dutoit et al. (2006)). Based on the literature about Design 
Rationale approaches in Software Engineering we derive an approach for reusing 
experiences in conceptual modeling. For this purpose we use the classification of 
rationale fragments accessing different rationale models resulting from various mod- 
eling projects. 



2 The design rationale approach 

According to the latest level of knowledge in software engineering issue models 
which represent the justification for a design in a semiformal manner are the most 
promising approach to solve the problems described above (cp. Dutoit et al. (2006)). 
They could be used for structuring the rationale in a more systematic way than text 
documentations do. In addition, implementing a knowledge base containing the ra- 
tionales of past modeling projects could improve the efficiency of future modeling 
processes as well as the quality of the outcoming artifacts. 

VAN DER Ven et AL. identified a general process for creating rationale, which 
most of the approaches have in common (cp. van der Ven et al. (2006), p. 333). 

After the problems are identified and described in problem statements they are 
evaluated one by one. Alternative solutions are created, evaluated and weighted for 
their suitability of solving the problem at hand. After an informed decision is made, 
it is documented along with its justification in a rationale document. 

Various approaches for capturing design rationale have been evolved. Most of 
them are basing on very similar concepts and are more or less restrictive. For our 
concerns we have chosen the QOC notation, because it is quite expressive and deals 
directly with evaluation of artifact features (cp. Dutoit et al. (2006), p. 13). 

2.1 The QOC-Notation 

The Questions, Options, and Criteria (QOC) notation is used for the design space 
analysis, which ’ [...] creates an explicit representation of a structured space of design 
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alternatives and the considerations for choosing among them [...] ’ (Mac Lean et al. 
(1991), p. 203). 

QOC is a semiformal node-and-link diagram. Though it provides a formal struc- 
ture, the statements within any of the nodes are informal and unrestricted. MacLean 
ET AL. define the three basic concepts, questions, options, and criteria. These con- 
cepts and their relations are depicted in Figure 1 . 



Question Option objects-tor 




Criterion 

Criterion 

Criterion 

1 



Argument 2 



Fig. 1. QOC notation 



Questions represent key issues of design decisions not having trivial solutions. 
They are means for structuring the design space of an artifact. Options are alternative 
solutions responding to a question. ’ [...] Criteria represent the desirable properties 
of the artifact and requirements that it must satisfy [...] ’ (MacLean et al. (1991), p. 
208). Because they state the objectives of the design in a clear and structured manner, 
they form the basis of evaluation, weighting and selection of a design solution. The 
labeled link between an option and a criterion displays the assessment whether an 
option satisfy a criterion. In doing so tradeoffs are made explicit and the discussion 
about choosing among the options turns focus to the purpose the design is made for. 

The presented design space analysis is an argumentation based approach. On this 
account all of the QOC elements could be supported or challenged by arguments. 
These arguments could play an important role for the evolution of the organizational 
knowledge base. In the case of reusing design solution the validity of the arguments 
the primary design decision was based on has to be proven. 

One objection to the utility of rationale models is that they are very complex 
and hardly to manage without any tool support (cp. MacLean et al. (1991), p. 216). 
Due to the complexity of the rationale models it is necessary to provide an effective 
retrieval mechanism. Otherwise this kind of documentation seems to be useless for a 
managed organizational memory. 
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2.2 Reuse of rationale documentation 

Since the capturing of design rationale takes considerable effort, the benefit from 
using the resulting models has to exceed the costs of their construction. 

Hordijk and Wieringa propose Reusable Rationale Blocks for reusing design 
knowledge in order to improve quality and efficiency of design choices (cp. Hordijk 
and Wieringa (2006)). For achieving this goal they use generalized pieces of decision 
rationale. 

The idea of Reusable Rationale Blocks bases on the QOC approach and on 
the concept of design patterns. Design Patterns are widely accepted approaches for 
reusing design knowledge. Though they provide a detailed description of a solu- 
tion for a repeating design problem, they lack evaluations of alternative solutions 
(cp. Hordijk and Wieringa (2006), p. 356). But they are appropriate options within 
a QOC-Model, which could be ranked by a set of quality indicators. In this way 
tradeoffs and dependencies among solutions can be considered. 

In order to define appropriate patterns and to assemble an experience base the 
documented argumentation, i.e. the rationale models, has to be analyzed. To support 
the analysis of the rationale documentation of several modeling projects an effective 
and efficient access is needed. This goal claims that all relevant information to the 
problem at hand is retrieved and no irrelevant information is element of the answer 
set. Precision and recall are accepted measures for assessing the achievement of this 
objective. 

The classification scheme presented in the next section could be regarded as an 
intermediate stage for editing the rationale information of project specific documen- 
tations to generate generic rationale information like the described Reusable Ratio- 
nale Blocks. 



3 Classification of rationale fragments 

The QOC notation is more restrictive than most of the other approaches and deals 
directly with the evaluation of artifact features. These are premises for classifying 
the options of divers rationale models as a systematic entry to the rationale docu- 
mentation. 

To depict our idea we use Fowlers Analysis Pattern (cp. Fowler (1997)). He 
discusses different alternatives for modeling derivatives. 
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(a) Subtyping 



Contract 



(b) Boolean Attribute 



Fig. 2. Alternative Modeling of Long and Short 



Figure 2 shows two different models of a contract and the distinction between 
Long and Short. In the first model subtyping is used for this purpose whereas the 
second one uses the Boolean attribute isLong. Fowler states that both alternatives 
are equivalent in conceptual modeling (cp. Fowler (1997), p. 177). 




(a) Option as a Subtype of a Contract (b) Option as separate Object 



Fig. 3. Different Structures of the Optionality of a Contract 



For modeling the concept Option Fowler presents two alternatives depicted in 
Figure 3 (cp. Fowler (1997), pp. 200ff.). In the first model the optionality of a contract 
is represented by subtyping. In this way an option is a t’"[...] kind of contract with 
additional properties and some variant behavior [...]t’" (Fowler (1997), p. 204). The 
second model differentiates between an option and its underlying base contract. Even 
Fowler can give only little advice for choosing among these alternative modeling 
solutions. 
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Fig. 4. Example for a Design Space Analysis 



For this purpose we analyzed the rationale for the modeling alternatives pre- 
sented by Fowler. Figure 4 shows an extract of the rationale model using QOC. 
The represented discussion bases on the assumption that there has been a decision 
to include the information objects Option, Long and Short in the model. From these 
decisions, there follow two Questions concerning the divers alternatives. 

On closer examination two different kinds of modeling issues can be derived 
from the provided solutions. The first one are problem solutions concerning the use 
of modeling grammar and its influence on the resulting model quality. For solving 
these problems the knowledge, experiences and assumptions of the modeling expert 
are decisive. 

As a second kind of issues we can identify questions concerning the structuring 
of the considered system. The expertise and the instinct of the domain expert should 
dominate this discussion. 

A rationale fragment contains at least a question and its associated options, cri- 
teria, and arguments. One single question deals either with structuring the problem 
domain or with applying the modeling grammar. While the considered options in 
the QOC model can be identified by means of the formal structure, the statements 
within the nodes are facing the common problems of information retrieval. If we can 
presume a defined terminology both of the application domain and of the modeling 
grammar a classification of the Options can identify Questions concerning similar 
design problems discussed in several rationale models. The resulting classification 
can be used as a starting point for the analysis of the archived rationale documenta- 
tion in order to accumulate and aggregate the specific project experiences. 

To exemplify our thoughts Figure 5 depicts a possible classification of rationale 
fragments. The two main branches, problem domain and modeling grammar, catego- 
rize the rationale information according to the experiences of the domain expert and 
the modeling expert respectively. 

The differentiation between these two kinds of modeling issues is also reflected 
in the two principles of the Guidelines of Modeling, construction adequacy and lan- 
guage suitability (cp. Schtitte and Rotthowe (1998), p. 246). Just these principles 
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Single Multiple 
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Derivatives ... Inheritance 



Problem Modelling 

Domain Grammar 



Conceptual 

Modelling 

Fig. 5. Classification of Rationale Fragments 



reveal that information modeling is characterized by various decision problems. So 
the choice of the information objects, relevant for the modeling problem, determines 
the appropriateness of the resulting model. Furthermore an agreement about the ap- 
plication of certain modeling techniques has to be settled. 

The branch referring to the usability and utility of the modeling grammar de- 
serves closer attention. Rationale documentations concerning these kinds of issues 
are not only useful for the model designer and user, but they are also invaluable as 
feedback information for an incremental knowledge base for the designers of the 
modeling method. 

Experiences in the method use, i.e. usage of the modeling grammar, are discov- 
ered as an essential resource for the method engineering process (cp. Rossi et al. 
(2004)). Rossi et al. stress these kind of information as a complementary part of 
the method rationale documentation. They define the method construction rationale 
and the method use rationale as a coherent unit of rationale information. 



4 Conclusion 

The paper suggests that a classification of design rationale fragments can support the 
analysis and reuse of modeling experiences resulting in an explicit and systematic 
structured organizational memory. 

Owing to the subjectivism in the modeling process the application of an argumen- 
tation based design rationale approach could assist the reasoning in design decisions 
and the reflection of the resulting model. Furthermore Reusable Rationale Blocks are 
valuable assets for estimating the quality of the prospective conceptual model. 

The semiformality of the complex rationale models challenges the retrieval of 
documented discussions relevant to a specific modeling problem. The paper presents 
an approach for classifying issues by its responding alternatives as a systematic entry 
in the rationale models as a starting point for the analysis of modeling experiences. 

What is needed now is empirical research on the impact of design rationale mod- 
eling on the resulting conceptual model. An appropriate notation has to be elabo- 
rated. This is not a trivial mission because of the tradeoff between a flexible model- 
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ing grammar and an effective retrieval mechanism. The more formal a notation is the 
more precise the retrieval system works. The other side of the coin is that the more 
formal a notation is the more the capturing of rationale information is interfering. 
But a high intrusive approach will hardly be used for supporting decision making on 
the fly. 
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Abstract. A clustering algorithm, in essence, is characterized by two features (1) the way in 
which the heterogeneity within resp. between clusters is measured (objective function) (2) the 
steps in which the splitting resp. fusioning proceeds. For categorical data there are no “stan- 
dard indices” formalizing the first aspect. Instead, a number of ad hoc concepts have been 
used in cluster analysis, labelled “similarity”, “information”, “impurity” and the like. To clar- 
ify matters, we start out from a set of axioms summarizing our conception of “dispersion” for 
categorical attributes. To no surprise, it turns out, that some well-known measures, including 
the Gini index and the entropy, qualify as measures of dispersion. We try to indicate, how 
these measures can be used in unsupervised classification problems as well. Due to its simple 
analytic form, the Gini index allows for a dispersion-decomposition formula that can be made 
the starting point for a CART-like cluster tree. Trees are favoured because of i) factor selection 
and ii) communicability. 



1 Motivation 

Most data sets in business administration show attributes of mixed type i.e. numerical 
and categorical ones. The classical text-book advice to cluster data of this kind can 
be summarized as follows 

a) Measure (dis-)similarities among attribute vectors separately on the basis of 
either kind of attributes and unite both the resulting numbers in a (possibly 
weighted) sum. 

b) In order to deal with the categorical attributes, encode them in a suitable (binary) 
way and look for coincidences all over the resulting vectors. Condense your 
findings with the help of one of the numerous existing matching coefficients. 

(cf. Fahrmeir et al. (1996), p. 453). This advice, however, is bad policy for at least 
two reasons. Treating both parts of the attribute vectors separately amounts to saying 
that both groups of variables are independent — which only can be claimed in excep- 
tional cases. By looking for bit-wise coincidences, as in step two, one completely 
looses contact with the individual attributes. This feature, too, is statistically unde- 
sirable. For that reason it seems to be less harmful to categorize numerical quantities 
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and to deal with all variables simultaneously — but to avoid matching coefficients 
and the like. During the last decade roughly a dozen agglomerative or partitioning 
cluster algorithms for categorical data have been proposed, quite a few based on 
the concept of entropy. Examples include “COOLCAT” (Barbara et al. (2002)) or 
“LIMBO” (Andritsos et al. (2004)). These approaches, no doubt, have their merits. 
For various reasons, however, it would be advantageous to rely on a divisive, tree- 
structured technique that 

a) supports the choice of relevant factors, 

b) helps to identify the resulting clusters and renders the device communicable to 
practitioners. 

In other words, we favour some unsupervised analogue to CART or CHAID. 

That type of procedure, furthermore, facilitates the use of prior information on 
the attribute level as it will be seen in Section 3. Within that context comparisons of 
attributes should not be based any longer on similarity-measures but on quantities 
that allow for a model-equivalent and accordingly, can be related to the underlying 
probability source. For that purpose we shall work out the concept of “dispersion” in 
Section 2 and discuss starting points for cluster algorithms in Section 3. The material 
in Section 2 may bewilder some readers as it seems that “somebody should have 
written down something like that long time ago”. Despite some efforts, however, no 
source in the literature could be spotted. 

There is another important aspect that has to be addressed. Categorical data is 
typically organized in form of tables or cubes. Obviously, the number of cells ex- 
ponentially increases with the number of factors taken into consideration. This, in 
turn, will result in many empty or sparsely populated cells and render the analysis 
obsolete. In order to circumvent this difficulty, some form of “sequential sub-cube 
clustering” is needed (and will be reported elsewhere). 



2 Measures of dispersion 

What Is a meaningful splitting criterion? There are essentially three answers pro- 
vided in the literature, “impurity”, “information” and “distance”. The axiomization 
of impurity is somewhat scanty. Every symmetric functional of a probability vector 
qualifies as a measure of impurity iff it is minimal (zero) in the deterministic case 
and takes its maximum value at the uniform distribution (cf. Breiman et al. (1984), 
p. 24). That concept is not very specific and it hardly gives way to an interpretation 
in terms of “intra-class-density” or “inter-class-sparsity”. Information, on the other 
hand, can be made precise by means of axioms that uniquely characterize the Shan- 
non entropy (cf. Renyi (1971), p. 442). The reading of those axioms in the realm 
of classification and clustering is disputable. Another approach to splitting is based 
on probability metrics measuring the dissimilarity of stochastic vectors representing 
different classes. Various types of divergences figure prominently in that context (cf. 
Teboulle et al. (2006), for Instance). That approach, no doubt. Is conceptually sound 
but suffers from a technical drawback In the present context. Divergences are defined 
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in terms of likelihood ratios and, accordingly, are hardly able to distinguish among 
(exactly or approximately) orthogonal probabilities. Orthogonality among cluster- 
representatives, however, is considered to be a desirable feature. 

A time-tested road to clustering for objects represented by quantitative attribute- 
vectors is based on functions of the covariance matrix (e.g. determinant or trace). It is 
near at hand to mimic those approaches in the presence of qualitative characteristics. 
However, there seems to be no source in the literature that systematically specihes 
a notion like “variability”, “volatility”, “diversity”, “dispersion” etc. for categorical 
variates. In order to make this conception precise, we consider functionals D, 

D : [0,°o[ 

where IP denotes the class of all hnite stochastic vectors, i.e. IP is the union of the sets 
Tk comprising all probability vectors of length K>2.D, of course, will be subject 
to further requirements: 

(PI) “invariance w.r. to permutations (relabellings)” 

D{p^^,...,Pa^)=D{pi,...,pK) 
for ah p = {pi, . . . , pk) G (Pk and ah permutations o. 

(MD) “dispersion is minimal in the deterministic case” 

D{p) = 0 iff p is an unit vector. 

(MA) “D is monotone w.r. to majorization” 

p <m q D{p) > D{q) p,qG^K- 

In particular, D takes its maximum at the uniform distribution (cf. Tong (1980), 
p. 102ff for the dehnition of <m and some basics). 

(SC) “splitting cells increases dispersion” 

D{p\,. . . ,pk-\,r,s,pk+i,. . . ,Pk) >D{pi,...,pk-uPk,Pk+U---,PK) ■ 
where p G (Pk, 0 < r,s and r + s = pk- 

(MP) “mixing probabilities increases dispersion” 

D{{\-r)p + rq) > {I - r)D{p) + rD{q) 

for 0 < r < 1 and p,q G (Pk- In addition to concavity we assume D to be 
continuous on all of (Pk, K>2. 
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(EC) “consistency w.r. to empty cells” 

D{pi,...,pK,0) <D{p]_,...,pk) 



for all p €(Pk- 

Definition 1. Afunctional D satisfying (PI), (MD), (MA), (SC), (MP), (EC) is called 
a (categorical) measure of dispersion. 

Some comments on this definition. 

1. The majorization ordering seems to be a “natural” choice and it guarantees that 
D is also a measure of impurity. “<m” could be replaced, however, by an or- 
dering expressing concentration around the mode. The restriction to unimodal 
probabilities (frequencies) and the dependency on a measure of location to be 
specified in advance, is somewhat undesirable. 

2. In an earlier draft, (EC) was formulated with “=” instead of “<”. Some helpful 
remarks made by C. Elenning and A. Ultsch lead to this modification. It allows 
for measures that relate dispersion to the length of the stochastic vector. This 
might be meaningful in tree-building in order to prevent a preferential treatment 
of attributes exhibiting many levels. Such an index, for instance, could take on 
the form 



p G int(pK) => D(p) = WK^^gipk), 

k 

where g is some “suitable” function (see below) and wk are some discounting 
weights. 

3. In case of ordinal variates, it makes sense to restrict the class of permutations in 
(PI). 

Eor the sake of convenience (and “w.l.o.g”) the axioms above were formulated by 
means of a linearly ordered indexing set. With two-way (or higher order) tables, 
multiple indices k= (i,j) G K — I x J are more convenient. The marginal resp. con- 
ditional distributions associated with probabilities (or empirical frequencies) (pij) 
on a / X 7-table are denoted as usual, e.g. 

p[^^ = Pi. orp^\\j\i) = pij/pi. . 

The next assertion parallels the well-known formula “o^(F) > E(a^(T|A))”. 

Proposition 1. Let D be a measure of dispersion and (pij) probabilities on a two- 
way-table. Then, 



Proof. Consequence of (MP) □ 
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The proposition implies that any measure of dispersion induces a predictive measure 
of association Aj), 



' c(pW) — =i:M2ii=i)p,"’ , 

where Ao(2| 1 = i) represents the conditional predictive strength of level i. For D{p) = 
1 — Pmax, Ad ist closely related to Goodman-Kruskal’s lambda. The measures Ad can 
be employed, for instance, to construct association rules. 



In what follows we shall restrict our attention to functionals D of the form 

Dg{p) =J2g{Pi) 

i 

where g is a continuous, concave function on [0, 1], g(0) = g(l) = 0 and g{t) > 0 for 

0 < t < 1. 

Examples. 

i.) g{t) = t{\-t)^Dg{p) = 1 -E^?-EE PiPj = trace(L) , 

* 

X = diag{p) - pp'^ 

i.e. D is the Gini-index resp. the generalized variance. More general Beta densi- 
ties could be employed as well. 

ii-) 8{t) = -t log t ^ Dg{p) = -^pi log Pi 

i 

i.e. D is the Shannon entropy. 

Proposition 2 . a) Dg is a measure of dispersion. 

b) If g is strictly concave, then Dg takes its unique maximum on (Pn at the uniform 
distribution ma: = A'^^(1,...,1). 

Proof. (PI), (MD) and (MP) are immediate consequences of the definition. (MA) 
follows from a well-known lemma by Schur (cf. Tong (1980), Lemma 6.2.1). In 
order to see (SC), just write r = apK,s= {l — a)pK and employ concavity. □ 

Obviously, Dg{p) can efficiently be estimated from an multinomial i.i.d. sample 
pn = 1*1 ^g(PN) is the strongly consistent ML- 

estimator of Dg{p) and distributional aspects can be settled with the help of the 
A-method. 

Proposition 3. Let p be an interior point of Tk- 
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a) If UK, then 

Lp{^{D,{pN)-D,{p)))^^{0,T,) 

whereTg = g'{pk),... g' {pk), ■■ -Y ■ 

b) If p = uk, then 

LYN{Dg{pN)-Dg{p)))^L 

where Yi,. . . ,Yk is a sample of standard-normal variates and where X\,. . . ^Xk 
denote the eigenvalues oflf/'^HJ}/^, H = diagf . . , u”{pk ), . . . ). 

Proof 

a) is a direct consequence of Witting and Muller-Funk (1995), Satz 5.107 b), p. 107 
("Delta method"). 

b) follows from their Satz 5.127, p. 134. □ 

The limiting distribution in b) must be worked out for every g seperately. For the 
Gini index Dq this becomes 




3 Segmentation 

Again, we start out from a sample of categorical (multinomial) attribute vectors. 
In general, a clustering corresponds to a partition of the objects 
{ 1, . . . ,A}. With categorical data we shall demand, that vectors contributing to the 
same cell should always be united in the same cluster. With that convention, a cluster- 
ing now corresponds to a partitioning of the cells, i.e. is related to the attributes. That 
makes it easy to formulate further constraints on the attribute-level. For instance, it 
can be required in the segmentation process that cells pertaining to some ordinal fac- 
tor only come along in intervals within a cluster. As already indicated, we are mainly 
interested in building up cluster-trees on the basis of some measure Dg. 

Now let p{m) be the average of all observations in cluster C{m). According to 
our convention, these cluster-representatives become orthogonal. If C{m) is further 
decomposed into two subclusters Ciim) and CR{m), then 

Dg{pM) - PL{m)Dg{pL{m)) - pR{m)Dg{pR{m)) >0 

is the gain in dispersion within clusters and is to to be maximized. A look at the 
corresponding formula characterizing a CART-tree (cf. Breiman et al (1984), p. 25), 
reveals that in the absence of the information in labels, a posteriori probabilities are 
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merely replaced by “centroids”. Matters become even more transparent in case of the 
Gini-index Dq. This is due to the identity 

Doiap + Pg) = o}Dg{p) + 3^£>g( 9') + 2aP( 1 - p^ q) 
where p,q & lP;r,a,3 > 0,a + 3 = 1, resulting in the general decomposition-formula: 

Dg{Pn) = 

I l^m 

= DG{within) +DG{between) 

where nN{m) denotes the proportion of observations in cluster C{m). With our con- 
vention ftN{m) = pN{m) and the decomposition formula becomes 

Dg{Pn) = '^pl{m)DG{pN{m))+^'^pN{l)pN{m) . 

m l^m 

Here, the quantity to be maximized simply becomes piini) ■ pR{m). Accordingly, 
a cluster is divided into subclasses of approximately the same size if no further prior 
information (restriction) is added. This solution to the clustering problem of course, 
is rather blunt and undesirable for most applications. It provokes the question, how- 
ever whether related measures (like the entropy) really produce partitions that allow 
for a better statistical interpretation. It remains to see, moreover, how well the ap- 
proach performs if restrictions, prior probabilities or label-information is provided. 

There is a promising alternative route based on the predictive measures of as- 
sociation introduced earlier. At each node the best predictor-attribute is selected. 
Attribute cells with a low conditional predictive power are merged into one and the 
node is branched out accordingly. The procedure stops if predictive power falls short 
a prescribed critical value. The whole device, in fact, can be interpreted as some form 
of non-linear factor analysis. It will be part of forthcoming work. 
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Abstract. A central task when integrating data from different sources is to detect identical 
items. For example, price comparison websites have to identify offers for identical products. 
This task is known, among others, as record linkage, object identification, or duplicate detec- 
tion. 

In this work, we examine problem settings where some relations between items are given 
in advance - for example by FAN article codes in an e-commerce scenario or by manually 
labeled parts. To represent and solve these problems we bring in ideas of semi-supervised and 
constrained clustering in terms of pairwise must-link and cannot-link constraints. We show 
that extending object identification by pairwise constraints results in an expressive framework 
that subsumes many variants of the integration problem like traditional object identification, 
matching, iterative problems or an active learning setting. 

For solving these integration tasks, we propose an extension to current object identification 
models that assures consistent solutions to problems with constraints. Our evaluation shows 
that additionally taking the labeled data into account dramatically increases the quality of 
state-of-the-art object identification systems. 



1 Introduction 

When information collected from many sources should be Integrated, different ob- 
jects may refer to the same underlying entity. Object identification aims at identifying 
such equivalent objects. A typical scenario is a price comparison system where offers 
from different shops are collected and identical products have to be found. Decisions 
about identities are based on noisy attributes like product names or brands. More- 
over, often some parts of the data provide some kind of label that can additionally 
be used. For example some offers might be labeled by a European Article Number 
(EAN) or an International Standard Book Number (ISBN). In this work we investi- 
gate problem settings where such information is provided on some parts of the data. 
We will present three different kinds of knowledge that restricts the set of consistent 
solutions. Eor solving these constrained object identification problems we extend the 
generic object identification model by a collective decision model that Is guided by 
both constraints and similarities. 
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2 Related work 

Object identification (e.g. Neiling 2005) is also known as record linkage (e.g. Win- 
kler 1999) and duplicate detection (e.g. Bilenko and Mooney 2003). State-of-the-art 
methods use an adaptive approach and learn a similarity measure that is used for 
predicting the equivalence relation (e.g. Cohen and Richman 2002). In contrast, our 
approach also takes labels in terms of constraints into account. 

Using pairwise constraints for guiding decisions is studied in the community of 
semi-supervised or constrained clustering - e.g. Basu et al. (2004). However, the 
problem setting in object identification differs from this scenario because in semi- 
supervised clustering typically a small number of classes is considered and often it is 
assumed that the number of classes is known in advance. Moreover, semi-supervised 
clustering does not use expensive pairwise models that are common in object identi- 
fication. 



3 Four problem classes 

In the classical object identification problem Cdassic a set of objects X should be 
grouped into equivalence classes Ex- In an adaptive setting, a second set V of objects 
is available where the perfect equivalence relation Ey is known. It is assumed that X 
and Y are disjoint and share no classes - i.e. Exf^EY = 0. 

In real world problems often there is no such clear separation between labeled 
and unlabeled data. Instead only the objects of some subset Y of X are labeled. We 
call this problem setting the iterative problem Qter where (X,Y,Ey) is given with 
X D F and Y^^Ey- Obviously, consistent solutions Ex have to satisfy ExC]Y^ = Ey- 
Examples of applications for iterative problems are the integration of offers from 
different sources where some offers are labeled by a unique identifier like an BAN 
or ISBN, and iterative integration tasks where an already integrated set of objects is 
extended by new objects. 

The third problem setting deals with integrating data from n sources, where each 
source is assumed to contain no duplicates at all. This is called the class of matching 
problems Cmatch- Here the problem is given by X = {Xi, . ..,X„} with X, nX, = 0 
and the set of consistent equivalence relations T. is restricted to relations Zs on X 
with EC\Xf = {{x,x) \x €. X,}. Traditional record linkage often deals with matching 
problems of two data sets (n = 2). 

At last, there is the class of pairwise constrained problems Cconstr- Here each 
problem is defined by {X,Rmi,Rci) where the set of objects X is constrained by a 
must-link R^i and a cannot- link relation Rd- Consistent solutions are restricted to 
equivalence releations E with E n Rd = 0 and E D Rmi ■ Obviously, Rd is symmet- 
ric and irreflexive whereas R^i has to be an equivalence relation. In all, pairwise 
constrained problems differ from iterative problems by labeling relations instead of 
labeling objects. The constrained problem class can better describe local informa- 
tions like two offers are the same/ different. Such information can for example be 
provided by a human expert in an active learning setting. 
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Fig. 1. Relations between problem classes: Cdassic C Qter C Cc„nstr and Cdassic C C,„atch C 

Cconstr- 



We will show, that the presented problem classes form a hierarchy Cdassic C 
Citer C Cconstr and that Cdassic C Cmatch C Cconstr hut neither Cynatch C Cfter ttor Qter C 
Cmatch (see Figure 1). First of all, it is easy to see that Cdassic C Cuer because any 
problem X G Cdassic corresponds to an iterative problem without labeled data {Y = 
0). Also Cdassic Q Cmatch because an arbitrary problem X G Cdassic can be trans- 
formed to a matching problem by considering each object as its own dataset: X\ — 
{xi}, ...,X„ = {x„}. On the other hand, Q,er % Cdassic and Cmatch % Cdassic, because 
Cdassic is not able to formulate any restriction on the set of possible solutions T, as 
the other classes can do. This shows that: 



r 






r- n 






( 1 ) 



Next we will show that Cuer C Cconstr- First of all, any iterative problem (A, T, Ey) 
can be transformed to a constrained problem {X,Rmi,Rd) by setting 
Rmi ^ {(>'i,>'2)bi =Ey yi} and Rd <- {(yi,y2)bi yi}- On the other hand, there 
are problems {X,Rmi,Rci) G Cconstr that cannot be expressed as an Iterative problem, 
e.g.: 

X = {xi,X2,X3,X4}, Rml = {{XI,X2),{X3,X4)}, Rd = & 

If one tries to express this as an iterative problem, one would assign to the pair (xi ,X2) 
the label Zi and to (x3,X4) the label l2- But one has to decide whether or not h = h- 
If h = h, then the corresponding constrained problem would include the constraint 
(x2,X3) G Rmi, which differs from the original problem. Otherwise, if Zi ^ h, this 
would imply (x2,X3) G Rd, which again is a different problem. Therefore: 



Citer C Cc! 



( 2 ) 



Furthermore, Cmatch C Cconstr because any matching problem Aj, . . . can be 
expressed as a constrained problem with: 

n 

= Rd = {{x,y)\x,y GXiRx^y}, Rmi = ^ 

1=1 

There are constrained problems that cannot be translated Into a matching problem. 
E.g.: 
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X = {xi,X2,X^}, Rml = {{xi,X2)}, Rd = & 



Thus: 



Cmatch C Cconstr 



( 3 ) 



At last, there are iterative problems that cannot be expressed as matching prob- 
lems, e.g.: 

X = {xi,X2,X2}, Y = {xi,X 2}, Xi=EyX2 
And there are matching problems that have no corresponding iterative problem, e.g.: 

Xi = {xi,X2}, X2 = {yi,y2} 

Therefore: 



Cmatch ^ Citer^ ^iter ^ ^match 



(4) 



In all we have shown that Cconstr is the most expressive class and subsumes all 
the other classes. 



4 Method 

Object Identification is generally done by three core components (Rendle and Schmidt- 
Thieme (2006)): 

1. Pairwise Feature Extraction with a function / : ^ R”. 

2. Probabilistic Pairwise Decision Model specifying probabilities for equivalences 

P[x = y\. 

3. Collective Decision Model generating an equivalence relation E over X. 

The task of feature extraction is to generate a feature vector from the attribute de- 
scriptions of any two objects. Mostly, heuristic similarity functions like TFIDF- 
Cosine-Similarity or Levenshtein distance are used. The probabilistic pairwise deci- 
sion model combines several of these heuristic functions to a single domain specific 
similarity function (see Table 1). For this model probabilistic classifiers like SVMs, 
decision trees, logic regression, etc. can be used. By combining many heuristic func- 
tions over several attributes, no time-consuming function selection and fine-tuning 
has to be performed by a domain-expert. Instead, the model automatically learns 
which similarity function is important for a specific problem. Cohen and Richman 
(2002) as well as Bilenko and Mooney (2003) have shown that this approach is suc- 
cessful. The collective decision model generates an equivalence relation over X by 
using sim{x,y) := P[x = y] as learned similarity measure. Often, clustering is used 
for this task (e.g. Cohen and Richman (2002)). 
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Table 1. Example of feature extraction and prediction of pairwise equivalence P[xi = xj] for 
three digital cameras. 



Object 


Brand 


Product Name 


Price 


Xi 


Hewlett Packard 


Photosmart 435 Digital Camera 


118.99 


X2 


HP 


HP Photosmart 435 16MB memory 


110.00 


X3 


Canon 


Canon EOS 300D black 18-55 Camera 


786.00 


Object Pair 


TFIDF-Cos. Sim. 
(Product Name) 


FirstNumberEqual 
(Product Name) 


Rel. Difference 
(Price) 


Feature Vector 


P[Xi=Xj] 




(X1,X2) 


0.6 


1 


0.076 


(0.6, 1, 0.076) 


0.8 


(xi,X 3 ) 


0.1 


0 


0.849 


(0.1,0, 0.849) 


0.2 


(X2,X3) 


0.0 


0 


0.860 


(0.0, 0, 0.860) 


0.1 



4.1 Collective decision model with constraints 

The constrained problem easily fits into the generic model above by extending the 
collective decision model by constraints. As this stage might be solved by clustering 
algorithms in the classical problem, we propose to solve the constrained problem by a 
constraint-based clustering algorithm. To enforce the constraint satisfaction we sug- 
gest a constrained hierarchical agglomerative clustering (HAC) algorithm. Instead 
of a dendrogram the algorithm builds a partition where each cluster should contain 
equivalent objects. Because In an object identification task the number of equivalence 
classes is almost never known, we suggest model selection by a (learned) threshold 
0 on the similarity of two clusters in order to stop the merging process. A simplified 
representation of our constrained HAC algorithm is shown in Algorithm 1 . The al- 
gorithm initially creates a new cluster for each object (line 2) and afterwards merges 
clusters that contain objects constrained by a mustlink (line 3-7). Then the most sim- 
ilar clusters, that are not constrained by a cannotlink, are merged until the threshold 
0 is reached. 

From a theoretical point of view this task might be solved by an arbitrary, prob- 
abilistic HAC algorithm using a special initialization of the similarity matrix and 
minor changes in the update step of the matrix. For satisfaction of the constraints 
Rmi and Rd, one initializes the similarity matrix for X = {xi , . . . ,x„} in the following 
way: 



if (xj,Xk) eRml 
-oo, if {xj,Xk) eRci 

P[xj = Xjt] otherwise 

As usual, in each iteration the two clusters with the highest similarity are merged. 
After merging cluster C[ with Cm the dimension of the square matrix A reduces by 
one - both in columns and rows. For ensuring constraint satisfaction, the similarities 
between c/ U Cm to all the other clusters have to be recomputed: 
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4^+1 



f if V A'^ = +00 

< -OO, if AJ ; = -oo V A;„ = -oo 

[ sim{ci U Cm,Ci) otherwise 



For calculating the similarity sim between clusters, standard linkage techniques 
like single-, complete- or average-linkage can be used. 



Algorithm 1 Constrained FIAC Algorithm 
1: procedure ClusterHAC(X, Rf„i, Rd) 

2: P^{{x}\xeX} 

3: for all (x,y) e R^i do 

4: c\ ^ c where c (z P f\x ^ c 

5: C 2 ^ c where c ^ P f\y ^ c 

6: (P\{ci,C2})U{ciUC2} 

7: end for 

8: repeat 

9 : (ci,C2)^ argmax sim{c\,C2) 

Cl,C 2 ef’A(ci XC2)nfid=0 

10: if sim{ci ,€ 2 ) >Q then 

11: F<- (F\{ci,C2})U{ciUc2} 

12: end if 

13: nntil sim[c\,C 2 ) < 0 

14: retnrn P 

15: end procedure 



4.2 Algorithmic optimizations 



Real-world object identification problems often have a huge number of objects. An 
implementation of the proposed constrained HAC algorithm has to consider several 
optimization aspects. First of all, the cluster similarities should be computed by dy- 
namic programming. So the similarities between clusters have to be collected just 
once and afterward can be inferred by the similarities, that are already given in the 
similarity-matrix : 



UC2,C3) 
«Wlc/(ci UC2,C3) 

UC2,C3) 



max{sinisi (ci , C3 ) , sinid (c2, C3 ) } 
min{simci (01,03), simd{c2 , 03 ) } 

1 0 1 1 ■ siniai (o 1 , 03 ) + 1 02 1 • siniai (02 , 03 ) 

kl| + |02| 



single-linkage 

complete-linkage 

average-linkage 



Second, a blocker should reduce the number of pairs that have to be taken into 
account for merging. Blockers like the canopy blocker (McCallum et al. (2000)) 
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Table 2. Comparison of F-Measure quality of a constrained to a classical method with different 
linkage techniques. For each data set and each method the best linkage technique is marked 
bold. 



Data Set 


Method 


Single Linkage 


Complete Linkage 


Average Linkage 


Cora 


classic/constrained 


0.70/0.92 


0.74/0.71 


0.89/0.93 


DVD player 


classic/constrained 


0.87/0.94 


0.79/0.73 


0.86/0.95 


Camera 


classic/constrained 


0.65/0.86 


0.60/0.45 


0.67/0.81 



reduce the amount of pairs very efficiently, so even large data sets can be handled. 
At last, pruning should be applied to eliminate cluster pairs with similarity below 
Q prune- These optimizations can be implemented by storing a list of cluster-distance- 
pairs which is initialized with the pruned candidate pairs of the blocker. 



5 Evaluation 



In our evaluation study we examine if additionally guiding the collective decision 
model by constraints improves the quality. Therefore we compare constrained and 
unconstrained versions of the same object identification model on different data sets. 
As data sets we use the bibliographic Cora dataset that is provided by McCallum et al. 
(2000) and is widely used for evaluating object identification models (e.g. Cohen et 
al. (2002) and Bilenko et al. (2003)), and two product data sets of a price comparison 
system. 

We set up an iterative problem by labeling N% of the objects with their true class 
label. For feature extraction of the Cora model we use TFIDF-Cosine-Similarity, 
Levenshtein distance and Jaccard distance for every attribute. The model for the 
product datasets uses TFIDF-Cosine-Similarity, the difference between prices and 
some domain-specific comparison functions. The pairwise decision model is chosen 
to be a Support Vector Machine. In the collective decision model we run our con- 
strained HAC algorithm against an unconstrained (‘classic’) one. In each case, we 
run three different linkage methods: single-, complete- and average-linkage. We re- 
port the average F-Measure quality of four runs for each of the linkage techniques 
and for constrained and unconstrained clustering. The F-Measure quality is taken on 
all pairs that are unknown in advance - i.e. pairs that do not link two labeled objects. 



F-Measure = 



2 • Recall ■ Precision 



Recall = 



Recall -L Precision 
TP TP 



TP + FN 



Precision = 



TP + FP 



Table 2 shows the results of the first experiment where A = 25% of the objects 
for Cora and N = 50% for the product datasets provide labels. As one can see, the 
best constrained method always clearly outperforms the best classical method. When 
switching from the best classical to the best constrained method, the relative error 
reduces by 36% for Cora, 62% for DVD-Player and 58% for Camera. An informal 
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F-Measure on unknown pairs 




Fig. 2. F-Measure on Camera dataset for varying proportions of labeled objects. 



significance test shows that in this experiment the best constrained method is better 
than the best classic one. 

In a second experiment (see Figure 2) we increased the amount of labeled data 
from N = 10% toN= 60% and report results for the Camera dataset for the best clas- 
sical method and the three constrained linkage techniques. The figure shows that the 
best classical method does not improve much beyond more than 20% labeled data. In 
contrast, when using the constrained single- or average-linkage technique the quality 
on non-labeled parts improves always with more labeled data. When few constraints 
are available average-linkage tends to be better than single-linkage whereas single- 
linkage is superior in the case of many constraints. The reason are the cannot-links 
that prevent single-linkage from merging false pairs. The bad performance of con- 
strained complete-linkage can be explained by must-link constraints that might result 
in diverse clusters (Algorithm 1, line 3-7). For any diverse cluster, complete-linkage 
can not find any cluster with similarity greater than 0 and so after the initial step, 
diverse clusters are not merged any more (Algorithm 1, line 8-13). 



6 Conclusion 

We have formulated three problem classes that encode knowledge and restrict the 
space of consistent solutions. For solving problems of the most expressive class 
Cconstr, that subsumes all the other classes, we have proposed a constrained object 
identification model. Therefore the generic object identification model was extended 
in the collective decision stage to ensure constraint satisfaction. We proposed a HAC 
algorithm with different linkage techniques that is guided by both a learned similar- 
ity measure and constraints. Our evaluation has shown, that this method with single- 
or average-linkage is effective and using constraints in the collective stage clearly 
outperforms non-constrained state-of-the-art methods. 
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Abstract. In this work we apply several data mining techniques that give us deep insight 
into knowledge extraction from a marketing survey addressed to the potential buyers of an 
university gift shop. The techniques are classified as symmetrical and non-symmetrical. An 
advocation for such combination is given as conclusion. 



1 Introduction 

When a large dataset is obtained from a survey including a large number of questions 
it is necessary to extract the information and the relationships inherent to the data in 
an ordered and effective way. The data is usually a mixture of subsets of quantitative, 
categorical (closed questions) and frecuency (open-ended) questions. 

In this work we analyze data extracted from an on-line survey by means of dif- 
ferent and complementary methods divided in two categories: symmetrical and non- 
symmetrical. The former will be some factor method complemented with classifica- 
tion, whereas the latter will comprise some sort of regression models. After present- 
ing data and objectives (section 2) we outline methodology and results (section 3) 
and finally give some conclusions (section 4). 



2 Data and objectives 

The University of the Basque Country (UPV/EHU), as part of a large project which 
main aim is revamping its corporate image, is about launching a corporate shop (also 
considered as a gift or souvenir shop). In order to better know its potential buyers 
and the potential success of it, it has set up an online survey to collect information 
on its acceptability. 



* Authors gratefully acknowledge financial support from Grupo de Investigacion Consoli- 
dado DEC UPV/EHU GIU06/53. 
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Such on-line survey is addressed to the members of the research and teaching 
staff, administrative staff and the students of the university. Its main objectives are to 
evaluate buying propensity about the corporate products, identify potential buyers’ 
and non-buyers’ profiles, know desirable characteristics of the products and obtain a 
function to be named and considered as a “propensity to buy”. 

Table 1 contains the sampling technical characteristics. The access to filling in 
the survey was possible only by invitation and there was a period of one month for 
doing so. The number of invitations or sample size was fixed per strata and chosen 
in order to get a maximum error of 2% of the variability range of the responses for a 
95% confidence level. The sampling was thus proportionallly random and the results 
were encouraging, with a global response rate of around 40%, though not equally 
distributed. 



Table 1. Technical characteristics of the on-line survey. 





Students 


Admin. Staff 


Research & Teaching 


Population 
Sample size 
Response (%) 
Sampling error 
Confidence level 


48995 

2289 

547 (23.9) 
0.042 
0.95 


1128 

768 

444 (57.81) 
0.036 
0.95 


3982 

1499 

754 (50.30) 
0.032 
0.95 



The most relevant questions included in the sample were: a question over general 
satisfaction about being a member of the university (5 point scale), a binary question 
on general interest about buying the corporate articles, 26 questions on the valuation 
(from 1 to 4) of the same number of products (shown in a photo), valuation (from 1 
to 7) of 8 proposed desirable characteristics of products (sober, traditional, stylish, 
modern, practical, artistic, daring and original) and personal information (gender, 
age, post and campus - up to three possible -). We were particularly interested in 
getting information on preferences on the products so we intentionally dropped the 
middle point in product valuation questions. These questions are those which we 
analyze by means of both non-symmetrical and symmetrical methods. We have made 
this distinction in order to differentiate between methods that assume some sort of 
causality or relationship direction in the variables (i.e., regression methods) and those 
who don’t (as factor methods). 



3 Methodology and results 

3.1 Symmetrical methods: Exploratory multivariate techniques 

Depending upon which kind of variables are to be considered as active we can con- 
sider a Principal Components (PCA) or a Multiple Correspondence Analysis (MCA), 
see e.g., Greenacre (1984), Lebart et al. (1984), Lebart (1994). 
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PCA of continuous variables and classification 

We first consider as active variables the scores given to the question of the desir- 
able characteristics of products (original, sober, ...), which are measured in a 7 point 
scale and may arguably be considered as near-continuous variables. The variables 
regarding personal characteristics as gender or age are considered as supplementary 
variables, as well as the variables reflecting satisfaction with the institution and the 
interest in buying. 

The first factor is a size factor which distinguishes between persons who select 
higher scores for all or most such characteristics from those who select lower values. 
Those who give higher marks are also people who manifest a greater satisfaction, 
interest in buying and are over 44 years old. The positive side of the second factor 
corresponds to higher scores given to sober, traditional, stylish and artistic and to re- 
spondents over 44, teaching-research staff and men and the negative side corresponds 
to higher scores given to daring, original and modern. Finally, the third factor locates 
individuals scoring high the term practical, who are mostly students and under 30. 

After performing a hierarchical clustering on the PCA first 5 axes, using the gen- 
eralized Ward criterion, this results in three clusters. The first one (46%) corresponds 
exactly to those on the positive side of the first factor (over 44, fully satisfied, with 
buying interest, high scores to all characteristics). The second one (31%) to individ- 
uals who rank high the characteristics of original, daring, modern and practical and 
who are students, under 30, neither satisfied or dissatisfied and who do not manifest 
buying interest. This is a group who might be attracted to the first group, composed 
of feasible buyers, by improving the characteristics of the products in the way they 
consider important. The last cluster (23%) give low scores to most of the characteris- 
tics and manifest no interest in buying and are also indifferent to the institution. This 
group seems a difficult one to reach to. 

This first analysis provides three main directions of variability by means of a 
PCA. The clustering over the main factors helps to group individuals into homoge- 
neous families where each cluster represents a market segment with different char- 
acteristics and reachable through different marketing strategies or perhaps products 
not considered here. 

MCA of categorical variables and classification 

As a second factor method, we choose the categorical variables referring to valuation 
of the 26 articles (after seeing a displayed photo) in a scale 1-4 as the active variables 
of a MCA. As supplementary variables we choose the products characteristics, the 
satisfaction variable, the intention to buy and the individuals’ personal data. 

Figure 1 shows the projection of the active categories on the MCA main plane. 
It shows how the first factor represents a global propensity to buy, roughly ordering 
categories from left to right with respect to their probability to buy, from lower to 
higher. The plane shows a typical Guttman effect with the second factor reflecting 
differences between extreme and centered opinions. 




186 Karmele Fernandez- Aguirre et al. 



Factor 2 - 6.67 % 



T-shi=2* Sweat=2 

B 

Backp=2 



Kerf1=1 Tie=1 

Kerf2=1 ♦ 

Keyri=1 Pin=1 Hat=1 ♦ Sculp=1 
Fem-T=1 Tr>ys=1 • • 

Sweat=1^ • SWBP6=r Trayp=1 

Black=U*> Mouse=1 

T-shi=1 . Walle=1 ^ 



Silve=2 

• ♦ MetWa=2 „ 

SklnW=2.* S»BPs=2 

♦ Cup=2 Trayp=2 Light=3 
Hight=2 ♦ Pin-2 Kerf1=2 

♦ ** Cap=2 ♦ / Bag=3 Kerf2=2 

>1 pem.T=2>’''ays=2 ♦ ♦ "'’•®‘^'-3Sweat=3 Black=3 

Mouse=2 *Hat=2* t Keyri=3 MetWa=3 

Backp=3 Mouse=3 ♦ Sculp=2 

• • • - * ♦ Tie=2 

« Pk. . Pk SWBPe=3 

SkinW=3 



Umbre=2 Cup=3*Cap=3 ♦ 
Walle=3 Pin=3* 
Silvers • 



T rayp=3 



Hat=3 Trays=3 
Umbre=3* FaCtOr 1 

Tie=3 ^ 

♦ Kerf1=3 

Sculp=3 « 
Kerf2=3 



14.13 % 



Cup=4 Mouse=4 
** ♦ T-shi=4 
Backp=4i» ♦ Bag=4 
Umbre=4 ♦* Light=4 
* * BlueP=4 

Keyri=4* ♦Fem-T=4 
Hat=4 • ^ Sculp=4 



Trayp=4 aK'nw=^ 

Tie=4^Walle=4 
Kerf1=4* ♦ ftlack=4 
Silve=4»* MetWa=4 
Kerf2=4 ^ ♦sWBPe=4 



Fig. 1. MCA: active categories on plane (1,2). 



With respect to the projections of the supplementary categories, it is shown in 
Figure 2 that the first factor is positively related to the satisfaction with the institution 
and the declared propensity to huy. This shows the relationship of these variables 
with the overall propensity to buy individually the 26 products. 





Factor 2 


- 6.67% 






BuyLo=2 Satis=3 




Satis=4 

\ 


BuyLo=1 




Satis=2 
Satis=1 ^ — '' 




\ 

\ 

Satis=5 


Factor 1 ■ 


- 14.13% 


Satis 











Fig. 2. MCA: supplementary categories on plane (1,2). 



A mixed classification in three steps is carried out on 8 MCA first principal axes. 
This process starts by choosing a partition in 10 clusters with random initial centers 
and then update those centers calculating the centroids of the groups of individuals 
nearest to the centers (ff-means algorithm); the process is repeated until the clusters 
are stable. We reduce further the number of clusters by means of a hierarchical algo- 
rithm (generalized Ward’s method) and refine the resulting partition with a consol- 
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idation step with re-assignment (testing moving centers with convergence achieved 
in 7 iterations). This results in a partition of 6 classes with an inter inertia over total 
inertia ratio of 55.62%. The positions of the final centers on the plane are given in 
Figure 3, and are following the pattern set hy the active categories on this same plane. 



Cluster 4/6 



Cluster 5/6 
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Cluster 3/ 6 



Cluster 2/6 



Factor 1 - 14.13% 



Cluster 6/ 6 



Cluster 1/6 



Fig. 3. Classification on MCA factors. Clusters centers and relative sizes represented by circle 
diameters. 



The partition description is as follows. Cluster 1 (15.73%) contains those who 
would prior buy, say is very likely to buy for many products, are over 44, fully satis- 
fied, females, members of the teaching and research staff, give high scores to stylish 
and traditional. Cluster 2 (17.91%) is formed by those who are likely to buy, over 44, 
would prior buy and rank highly stylish, traditional and sober. In cluster 3 (17.74%) 
predominate those who say it is unlikely to buy sober and stylish products (metal- 
lic) but it is likely to buy original, modern and practical products (textiles and bags). 
Cluster 4 (12.80%) groups individuals unlikely to buy anything with low scores for 
stylish products. Cluster 5 (18.66%) is composed of individuals very unlikely to buy, 
aged between 18 and 22, students, from Gipuzkoa campus, neither satisfied or dis- 
satisfied and with low scores on traditional, sober or stylish. Finally, on cluster 6 
(17.16%) are those who are very unlikely to buy, between 30 and 44, males and with 
low marks for all characteristics of the products. 

This MCA confirms the tight relationship between the interest to buy articles 
featuring the logo (before visualization), the degree of satisfaction about the insti- 
tution and the scores given to the proposed desirable characteristics of the products. 
The clustering process shows marketing implications on the buyers’ and non-buyers’ 
personal characteristics and on which articles are perceived as stylish, traditional and 
sober and which ones as modern, original and practical. Furthermore, the parabolic 
path apperaring in Figure 1 is similar to those shown in Figures 2 and 3, reinforcing 
its interpretation as an indicator of the propensity to buy the displayed products. 
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3.2 Non-symmetrical methods: regression related techniques 

In this section we consider methods where one variable is chosen to be depending on 
others. In this work, the variable of interest is the probability, or propension, to buy 
and is exactly our choice for the endogeneous variable. 

PLS path modelling 

PLS path modelling (see, e.g., Tenenhaus et al. (2005)) is a technique based on the re- 
lationships between latent variables in a regression framework where such variables 
are constructed with underlying manifest variables (MV). In this case, the variables 
are those obtained with the questions of the survey. 

We are going to construct a global propensity to buy using all manifest variables, 
resulting in a global latent variable (LV). At the same time, we want unidimensional 
partial propensities to buy groups of products and these to be autoselected by the 
data, we do not want to impose any additional structure, other than the imposed by 
the model itself. These will also have the form of LVs and will be sought with a 
previous PCA of the valuations of all the 26 products displayed in the survey. 

Table 2 contains the 8 groups of products formed in the way explained above. 
These groupings originate directly 8 partial LVs, using mode B. 



Table 2. Groups of products to be considered as LV. 



label 


LV 


products 


umbh 




umbrella, hat 


tie 


^2 


tie, kerchief no.l, kerchief no.2 


textiles 


^3 


T-shirt, T-shirt- V, sweater, cap 


bag 


^4 


plastic tray, leather tray, backpack, bag, cup 


wat 




leather- strapped watch, metallic-strapped watch, wallet 


mous 


^6 


keyring, lighter, mousepad 


scul 


^7 


pin, sculpture 


pens 


^8 


blue pen, black pen, silver pen, silver pen in wooden case 



Selecting all products valuations, we construct the global propensity to buy using 
mode A. Finally, we formulate the external model ^ = X)/=i 

Figure 4 shows the path model specified. The numbers are correlations and show 
relatively high values between the partial LVs and the global one. We can also see 
the pairwise correlations between individual MVs and the LVs. 

The actual estimates of the external model parameters are given in equation (1). 
These show higher values for textiles, bags and pens products groups, which are 
those with a higher acceptability among the respondents. 



£(^) = 0.0865*umbh-F0.1335*tie + 0.2041 *textiles + 0.2114*bag 
-FO.1791 * wat 4-0. 1292 * mous + 0.0881 * scul 4- 0.2322 * pens 



( 1 ) 
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0.71 

0.69 

0.71 

0.64 
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0.67 

0.66 

0.65 

0.61 

0.74 

0.75 

0.72 

0.69 

0.58 

0.53 

0.65 

0.45 

0.70 

0.76 

0.75 

0.73 
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Fig. 4. PLS path diagram for products to be sold at the university shop. 



In order to get a potential buyers’ characterization (similar to the projection of 
supplementary variables in a factor analysis), we perform a regression on the de- 
sirable characteristics of the products and the respondents’ personal characteristics. 
This is actually a Principal Components Regression (PCR), since the desirable char- 
acteristics are highly correlated, selecting 2 main components out of the 7 original 
variables. 



£(^) = —0.85 + 0.07 * FI (orig., daring, practical, artistic, modern) 

-fO. 1 1 * F2 (traditional, sober, stylish) — 0.25 * male 
-bO.15 * satisfied -b 0.26 * very satisfied -b 0.07 * age(H-44) 

-b0.06 * teaching-research staff — 0. 10 * higher education 
-b 1 . 1 8 * overall propensity to buy a logo product 
-bO. 14 * campus: Araba-b 0. 12 * campus: Bizkaia 

= 0.4848 

All parameters whose estimates are shown are significant at the 5% level, both 
using bootstrap confidence intervals and usual t-test statistics. These estimates show 
how those individuals most satisfied with the university are more likely to buy, along 
with women. It is also so for those who have a prior intention to buy, members of 
teaching and research staff, older age and those proceeding from the campuses of 
Bizkaia and Araba from over those from Gipuzkoa. With respect to product charac- 
teristics, those marking as more important the terms traditional, sober and stylish are 
more likely to buy than individuals giving more importance to aspects as modern, 
practical and so on. 
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Logit models 

Finally, we have calculated a logit regression (see, e.g., Flosmer and Lemeshow 
(2000)) on individuals’ personal characteristics, products characteristics and the sat- 
isfaction variable where the dichotomous endogeneus variable is the response (yes 
or no) to the question if the respondent would, in general, buy university corporate 
products. This is a prior probability in the sense that individuals had to respond to 
that question before actually seeing the products. 

We have also considered the construction of a posterior probability to buy and 
then estimated another logit model with this probability as the endogeneous vari- 
able. Thus, an individual is considered to be likely to buy one product if he or she 
scores 3 (likely) or 4 (very likely) for that product. In the same way, an individual is 
considered to buy articles if he or she would likely buy more than 25% of all articles 
(at least 7 articles). 

As in the PLS path model case, the desirable characteristics of the products are 
highly correlated and we have substituted them by two principal PCA factors (after 
performing a Varimax rotation). 

We end up with the following two model estimates: 

1. Prior probability model estimates (Nagelkerke = 0. 140): 

A'3 = —0.5 10 -F 0.267 * teach./res. -F 0.307 * Bizkaia -F 0.398 * age over 44 
-FO.797 *satished-F 1.160* very satisfied + 

-FO.220 * FI (innovative-Hpractical) -F 0.272 * F2 (classic) 

2. Posterior estimates (Nagelkerke P? = 0.502): 

Z'3 = —1.298 + 0.537 * student -F 0.584 * teach./res. — 0.794* male 
+0.367 * satished + 0.710* very satisfied + 0.339 * F2 (classic) 
+2.979 * buying initial interest 

The prior probability model yields very similar results to those from the PLS path 
model and the factor analyses performed in the previous subsection. The posterior 
probability model yields, with a better fit, results not so similar, what can be due to 
the particular construction of the endogeneous variable. That construction is sensitive 
but also subjective and it can only be considered as a help to better know the structure 
of the data. 



4 Conclusions 

Each different technique used shows specihc, though related, conclusions given its 
different objectives. The symmetrical methods (PCA, MCA) combined with Cluster 
Analysis help to learn what is contained in the data, including relationships and clas- 
sifications of similar individuals. On the other hand, non-symmetrical methods as 
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PLS or Logit regressions allow for modelling individuals’ global and partial (group) 
behaviour using inference tools to select a better model with a good fit to the data. 

The methods exposed above extract consistently some facts from this particular 
data. The gift shop potential buyers’ general characteristics become clear (satisfied 
with the institution, members of the teaching-research staff, women...). At the same 
time, it is also clear the general characteristics of the articles shown (traditional, ...) 
and the sort of characteristics of possible successful articles not covered in current 
product line (practical, original or modern). It seems that a better, more modern, 
design is needed to reach other market segments. 

The marketing implications obtained have been somewhat conditioned upon the 
actual articles displayed with photographs in the on-line questionnaire. It has been 
observed that many have been perceived as stylish and traditional (generally of a 
metallic aspect) and of little appeal for the young. As a general issue, this work rec- 
ommends the promotion of articles with the characteristics mentioned above and, 
particularly, belonging to the groups of textiles, bags and desktop articles which 
would yield a better acceptance for this target public in the opening university gift 
shop. 

All in one, it can be said that these data mining techniques yield useful directions 
for the university marketing policy, regarding the corporate shop. The combination 
of techniques, though never fully exhaustive, reinforces the confidence on the results 
as it is improbable to having missed important patterns in the data. 
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Abstract. Many problems in industrial quality control involve n measurements on p process 
variables Xn.p. Generally, we need to know how the quality characteristics of a product behav- 
ior as process variables change. Nevertheless, there may be two problems: the linear hypothe- 
sis is not always respected and q quality variables are not measured frequently because of 
high costs. B-spline transformation remove nonlinear hypothesis while principal component 
analysis with linear constraints (CPCA) onto subspace spanned by column X matrix. Linking 
Yn^q and A„ p variables gives us information on the Y„ q without expensive measurements and 
off-line analysis. Finally, there are few uncorrelated latent variables which contain the infor- 
mation about the Y^^q and may be monitored by multivariate control charts. The purpose of 
this paper is to show how the conjoint employment of different statistical methods, such as 
B-splines, Constrained PCA and multivariate control charts allow a better control on prod- 
uct or service quality by monitoring directly the process variables. The proposed approach is 
illustrated by the discussion of a real problem in an industrial process. 



1 Introduction 

Frequently firms have to define how to select the process parameters which mostly 
influence the quality characteristics of a product. The selection of the "optimal" com- 
bination of parameters and the choice of statistical methods to solve this problem 
could be no simple question. In this paper, we have proposed some statistical tech- 
niques to determinate the "best" technology for pasta production. 

Quality characteristics of pasta, tested in laboratory, can be divided in two clus- 
ters: "colour-appeal" and "taste". Customers prefer clear and amber pasta without red 
vein. Besides, the pasta must be characterise by "al dente" stage in case of overcook- 
ing or undercooking (Abecassis et al., 1992). 

In this paper, we suggest a nonlinear approach to select the "best" technology for 
pasta production, spaghetti about 0.04 in (diameter), and choose process parameters 
was to monitor. 
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In the first step, we define the different setting of the manufacturing process 
which can be used. To obtain an optimal setting, it is necessary to consider three 
process parameters: temperature (T) , drying time {DT), damp (D). Forty-five tests 
have been running with different combinations of process parameters. At the same 
time, quality characteristics have been measured by six variables: viscosity on a 1-9 
category scale (F), judgement on taste in case of overcooking {Nl) and undercooking 
(N2) on a 0-9 category scale, homogeneity of red (A), yellow {B), brown (100-L). In 
the second step, we define every new relation between response variables (T 45 g) and 
process variables (X45 7) by-means of multivariate statistical methods such as Con- 
strained Principal Component Analysis (CPCA - D’Ambra and Lauro, 1982). In the 
third step, since CPCA analysis shows a horseshoes effect in data set, we propose a 
B-spline transformation in data before interpreting results. In the last step, we define, 
by means a Shewhart charts, the "optimal" combination of process parameters which 
produces the "best" pasta. 

The use of traditional control charts to monitor the process variables instead of 
the response ones is a good solution for many reasons. First, the process variables 
are measured much more frequently, usually in the order of seconds or minutes as 
compared to hours for the response variables. Second, process variables are generally 
measured in a more precise way than response variables. Third, CPCA components 
are always independent even when single variables are correlated. 

The aim of this paper is to show how the CPCA methods can be used in case of 
nonlinear data and the employment of techniques like Multivariate Principal Com- 
ponent Charts (MPCC - MacGregor and Kourti, (1995)) can aid in the interpretation 
of results. The paper is organised as follows. In Section 2 CPCA method is applied to 
pasta data. A horseshoes effect is present on raw data. Different approaches to solve 
this problem is given in Section 3, in particular, B-spline transformation on X data is 
applied. In Section 4 the results of CPCA on B-spline transformed data are tested by 
a stability analysis. A first interpretation of CPCA results is given in Section 5. 



2 Constrained principal component analysis 

Let Xn p and T„ ^ be the raw data matrices associated with two sets of quantitative 
variables observed on the same experimental units. Furthermore let Q and D be sym- 
metric and positive definite matrices of qth-order and nth-oider respectively. In the 
remainder of the paper, we will consider X and Y standardised data matrices, hence 
2=1. The CPCA (D’Ambra and Lauro, 1982) aim is to analyse the structure of the 
explained variability of the Y data set given the process variables X. Let 

Px=X{X'DxX)-^X' (1) 

be the D-orthogonal projector onto the space spanned by the columns of X CPCA 
consists in carrying out a PCA on the matrix 



Y = PxY 



( 2 ) 
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Figure 1 shows a scatter plot of the hrst two principal components of the sta- 
tistical study (Y,D,I). It explain nearly all the data variability (87.80%) but in this 
representation the second axis is a special arched function of the hrst axis. CPC A cre- 
ates a serious artifact called the horseshoes effect. This is a problem because CPCA 
perform better when the 45 experimental tests have a monotonic distributions along 
gradients (i.e. either increase or decrease but not both). To resolves horseshoes prob- 
lem and gives more interpretable results, nonlinear transformation of data can be 
used (Gih, 1990). 




Fig. 1. Plot of the first and second Constrained Principal Component. 



3 Nonlinear Constrained Principal Component Analysis 

B-spline approach (Durand, 1993) allows a greater flexibility in the adjustment of 
dependence between the X and Y sets of variables. 

Let Sj{xj)Bj be the transformation of x/-column, j = I,... ,p , Sj{n,k) the B- 
basis spline with a priori fixed order and knots (De Boor, 1978; Eubank, 1988), 
Bj{k,q) is the matrix of coefficient. 

Similarly we can write S as: 



and B as: 






Bi 

Bp. 



( 4 ) 



Consider the following multivariate model 
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X = SB + E (5) 

In order to estimate the B we will minimise the trace E'E. Then 

min||X-5B||^ (6) 

B 

Consider the class of reduced-rank regression for the multivariate linear model 
with rank{B) = r < g)] (Izenman, 1975). With such condition there will 

exist two (non-unique) matrices Br = ArGr where Ar and Gr are both of rank r. So 
we have to minimise 



mm IIX (7) 

AfGf 

The solutions for the minimisation of (7) are given by Gr = [v j . . . vj.] , Ar = 
■ ■ .Vr\ where Vk is the eigenvector corresponding to the k largest eigen- 
value ofY'S{S'S)-^S'Y (Izenman, 1975). 

The regression coefficient with rank r is therefore given by 

r 

Br = {S'SyS'X[Y^Vkv',] ( 8 ) 

k=\ 

This solution is linked to an extension of CPCA, called CPCC-additive spline, 
concerning a PC A of the image of yj onto B -basis spline with knots chosen in the 
range of each yj, j = In this case we have Y* = PgY = S{S'S)^^S'Y and we 

carrying out PCA of the statistical study {Y*,D,I). 

A second approach (Durand, 1993) searches a matrix transformation C of X and 
a matrix R to minimise the distance between the scalar product operators YY'D and 
CRC'D: 



mm\\YY'D-CRC'D\y (9) 

C,R 

with C = SB and where S is the B-spline matrix with a priori fixed order and 
knots. The minimum can be attained by an approximate solution based on an alter- 
nate iterative procedure. 

A more recent method is the two-stage approach to engine mapping by using 
B-spline basis functions at the second stage to describe the effects of one or more 
factors (splined factors) and low-order monomials to represent the main effects and 
interactions of the remaining (nonsplined) factors (Grove et ak, 2004). 

In this paper we have used the first approach. The first principal component of 
the statistical study Y*,D,I explains the 8 1 . 10% of the total variation of the matrix Y . 
The 96.80% of the total variance is explained by the first two principal components. 
A stability analysis can be performed to evaluate the goodness of the results. 
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4 Stability analysis 

Daudin (1988) suggests the study of stability by bootstrap. The basic idea of boot- 
strap is to generate many new matrices starting from the raw data. Any new matrix 
is obtained by a random replacement of the original rows. Applying bootstrap on Y*, 
we generate m new matrices ^Y* where I = Let X, and be the ith eigen- 

value and the associated eigenvector of the correlation matrix of Y* and ^'jlf the fth 
eigenvector of the correlation matrix of ^Y* . Furthermore let 

9if = cos{ (10) 

and 

V = (11) 

where i,f = and k is the number of the examined eigenvalues. 

Plotting respect x-axis and j versus y-axis , the orientation of the first two 
eigenvectors seems to be stable (Figure 2). In fact, it is not considerably modified 
over the 250 replications. 




Fig. 2. The stability representation for the first two eigenvectors. 



The stability of the components can be confirmed by the following quantity 

MSE(1:) = - k)\ (12) 

m 

If MSE(A:) is near to zero, the examined k components are stable. Here for k = 2 
and m = 250 the result is 0.000084. 



5 Results and interpretation 

Eigure 3 shows the representation of 45 samples of the first two principal compo- 
nents, where the percent of the total variability explained is 96.80. This percentage 
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allows a good description of data structure and stability analysis indicates that this 
data structure could be considered stable. Furthermore, the dotted lines show that B- 
spline transformation have smoothed raw data and the problem of nonlinearity would 
seem to be eliminated. 




Fig. 3. Plot of the first and second Nonlinear Constrained Principal Component; the points are 
the 45 different tests and the vectors are variables: temperature (F); drying time (DT)\ damp 
(Z>); interaction between temperature, drying time and damp (T*DT*D); temperature and dry- 
ing time(T*DT); temperature and damp (T*D); drying time and damp (DT*D); viscosity (F); 
judgement on taste in case of overcooking (NI) and undercooking (A12); homogeneity of red 
(A), yellow (B), brown (100-L). 



The first axis of representation could be called "taste" as it is characterised by 
contributions of viscosity and judgement on taste together with contributions of red 
and brown colour. All these variables are positively correlated with "taste" (about 
0.97). The second axis could be called "colour-appeal", as the yellow colour con- 
tributes to this axis with 98%. The process variables which have a positive influence 
on "taste" are temperature, drying time, their interaction and the interaction between 
drying time and damp. On the contrary, all the other variables have a negative influ- 
ence. The second axis is characterised mostly by the fact that drying time and damp 
are each at the opposite side of the other, this contrast influences the homogeneity of 
yellow. 

The PCA of statistical study (Y*,D,I) indicates only which process variables 
influence the quality characteristics of products. The direction where to look for the 
best combination of process parameters (Abecassis et al., 1992) is along the diagonal 
D-DT (Figure 3). This information could be not sufficient clear because along the 
diagonal D-DT there are a lot of different combinations of process parameters. In 
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this case, a graphic display, such as Shewhart charts, could give some information 
about the optimal combination to choose for the production of pasta. 

The scores could be projected onto Shewhart charts where Central Line (CL), 
Upper Decision Line (UDL) and Lower Decision Line (LDL) are 0, 0+3 and 0-3 
respectively. In this paper, these "Multivariate Principal Component Charts" (MPCC) 
are used for the first principal component (Figure 4. a), and the second one (Figure 
4.b) or both, according to marketing decisions, that is maximise "taste" or "colour- 
appeal" or choose the optimal mix of "taste" and "colour-appeal". 




Fig. 4. MPCC for the first (a) and the second (b) Nonlinear Constrained Principal Component. 



In Figure 4. a, the experimental tests 34, 38, 39, 40, 44 and 45 could suggest that 
temperature must be higher than 100°C to give the best value for "taste". Figure 4.b 
shows that the best value for "colour-appeal" is obtained in correspondence of tem- 
perature 90°C, drying time 2.5 or 5, and damp 5.5. The "optimal" mixture of "taste" 
and "colour-appeal" is obtained in correspondence of the maximum value taken in 
Figure 4.b, by the experimental tests which are out of the UDL in Figure 4. a. The 
experimental test 40 could be represents the "optimal" combination of parameters in 
term of "taste" and "colour-appeal". 



6 Concluding remarks 

Today the advent of on-line process computer system have totally changed the nature 
of the data that are available. The use of multivariate statistical methods is necessary 
to treat the problems associated with these large volumes of messy data. We can use 
all the information contained in data, to improve the quality of products and pro- 
cesses. Multivariate analysis as Constrained Principal Component Analysis could be 
employed to determine the relationships between the quality characteristic of prod- 
ucts with the process parameters. In this way, we can select the best technology to get 
a quality product and/or to monitor the quality characteristics of product by process 
variables. 

In many situation it is reasonable attend to the presence of anomalies observa- 
tions, in these cases principal components are influenced and may not capture the 
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variation of the regular observations. Therefore, data reduction based on PCA be- 
comes unreliable. When outliers are present in the data, to obtain a more accurate 
estimates at noncontaminated data sets and more robust estimates at contaminated 
data a method for robust principal component analysis could be used (Hubert et al., 
2005). 
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Abstract. Statistical process control (SPC) chart is aimed at monitoring a process over time in 
order to detect any special event that may occur and find assignable causes for it. Controlling 
both product quality variables and process variables is a complex problem. Multivariate meth- 
ods permit to treat all the data simultaneously extracting information on the “directionality" 
of the process variation. Highlighting the dependence relationships between process variables 
and product quality variables, we propose the construction of a non-parametric chart, based on 
Multivariate Additive Partial Least Squares Splines; proper control limits are built by applying 
the Bootstrap approach. 



1 Introduction 

The multivariate nature of product quality (response or output variables) and pro- 
cess characteristics (predictors or input variables) highlights the limits of any anal- 
ysis based exclusively on descriptive and univariate statistics. On the other hand, 
the possibility for process managers of extracting knowledge from large databases, 
opens the way to analyze the multivariate dependence relationships between qual- 
ity product and process variables via predictive and regressive techniques like PLS 
(Tenenhaus, 1998; Wold, 1966) and its generalizations (Durand, 2001; Lombardo et 
al, 2007). In this paper, the application of a multivariate control chart based on a 
generalization of PLS-T^ chart (Kourti and MacGregor, 1996) is proposed in order 
to analyze the in-control process and monitoring it over time. Furthermore, in order 
to face the problem of the unknown distribution of the statistic to be charted, a non- 
parametric approach is applied for the selection of the control limits. Distribution- 
free or non-parametric control charts have been proposed in literature to overcome 
the problems related to the lack of normality in process data. An overview in lit- 
erature on univariate non-parametric control charts is given by Chakraborti et al. 
(2001). The principles on which non-parametric control charts rest can be general- 
ized to multivariate settings. In particular, the bootstrap approach to estimate control 
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limits (Wu and Wang, 1997; Jones and Woodall, 1998; Liu and Tang, 1996) has been 
followed. 



2 Multivariate control charts based on projection methods 

A standard multivariate quality control problem occurs when an observed vector of 
measurements on quality characteristics exhibits a significant shift from a set of tar- 
get (or standard) values. The first attempt to face the problem of multivariate process 
control is due to Hotelling (1947) who introduced the well-known chart based 
on variance-covariance matrix. Successively, different approaches to take into ac- 
count the multivariate nature of the problem were proposed (Woodall, Ncube, 1985; 
Lowry et al., 1992; Jackson, 1991; Liu, 1995; Kourti and MacGregor, 1996, Mac- 
Gregor, 1997). In particular, we focus on the approach based on PLS components 
proposed by Kourti and MacGregor (1996), in order to monitor over time the depen- 
dence structure between a set of process variables and one or more product quality 
variables (Hawkins, 1991). The PLS approach proves to be effective in presence of 
a low-ratio of observations to variables and in case of multicollinearity among the 
predictors, but a major limit of this approach is that it assumes a linear dependence 
structure. Generally, linearity assumption in a model is reasonable as first research 
step, but in practice relationships between the process variables and the product qual- 
ity variables are often non-linear and in order to study the dependence structure it 
could be much more appropriate the use of non-linear models (PLS via Spline, i.e. 
PLSS; Durand, 2001) as proposed by Vanacore and Lombardo (2005). The PLSS-T^ 
chart allows to handle non-linear dependence relationships in data structure, miss- 
ing values and outliers, but it presents two major drawbacks: 1) it does not take into 
account the possible effect of interactions between process variables; 2) it requires 
testing normality assumption on the component scores, even when original data are 
multinormal (in fact, in case of spline, i.e. non linear transformations of original 
process variables, the multinormality assumption cannot be guaranteed anymore). 
To overcome these drawbacks we present non-parametric Multivariate Additive PLS 
Spline-T^ chart based on Multivariate Additive PLSS (MAPLSS, Lombardo et al., 
2007) briefly described in sub-section 2.2. 

2.1 Review of MAPLSS 

MAPLSS is just the application of linear PLS regression of the response (matrix Y 
of dimension n, q) on linear combinations of the transformed predictors (matrix X 
of dimension n,p) and their interactions. The predictors and bivariate interactions 
are transformed via a set of K = <7 -f 1 -f w (<7 is the spline degree and m is the knot 
number) basis functions, called B-splines B;(.), so as to represent any spline as a 
linear combination 

K 

i=i 
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where (3 = (3i, ..,3 k) is the vector of spline coefficients computed via regression of 
yG Ron the B[{.) The centered coding matrix or design matrix including interactions 
becomes 



B= ...J , (1) 

i€Ki [k,l)GK2 

where K\ and K 2 are index sets for single variables and bivariate interactions, re- 
spectively. In a generic form, the MAPLSS model, for the response j, can be written 
as 

y^(A)=^3/(A)B', (2) 

IeL 

where A is the space dimension parameter and L is the index set pointing out the pre- 
dictors as well as the bivariate interactions retained by MAPLSS. It is thus a purely 
additive model that depends on A which in turn depends on the spline parameters 
(i.e. degree, number and location of knots). 

Increasing the order of interaction in MAPLSS implies expanding the dimension of 
the design matrix B. MAPLSS constructs a sequence of centered and uncorrelated 
predictors, i.e. the MAPLSS (latent) components (t^, ...,1^^). We now briefly describe 
the MAPLSS building-model stage. In the first phase we do not consider interactions 
in the design matrix. This phase consists of the following steps 

step 1 Denote Bo = B and Yq = Y the design and response data matrices, respec- 
tively. Define = Bqw^ and = Yqc' as the first MAPLSS components, where 
the weighting unit vectors w* and c' are computed by maximizing the covari- 
ance between linear compromises of the transformed predictors and response 
variables, cov(ti,ui). 

step k Compute the generic MAPLSS component 

t^ = B,t-iwV = Y,t-ic^ (3) 

Update the new matrices B,t and Y^, as the residuals of the least-squares regres- 
sions on the components previously computed using the orthogonal projection 
operator Pj* on t^, that is Pj* = t*t^'/|jt^|p, we write 





(4) 


Y^ = Yh-P,,Y,_i. 


(5) 



Final Step The algorithm stops on the base of the A number of components defined 
by PRESS criterion. 

In the second phase of the MAPLSS building-model stage, we individually evaluate 
all possible interactions. The rule for accepting a candidate bivariate interaction is 
based on the gain in fit {R^) and prediction (GCV criterion) compared to that of the 
model with main effects only. Then, the selected interactions are ordered in decreas- 
ing value for consideration to adding them step-by-step to the main effects model. At 
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the end, in the final phase we include in the design matrix B the selected interactions 
and repeat the algorithm from step 1 to the, final step. 

A simple way to illustrate the contribution of predictors to response variables, con- 
sists of ordering the predictors with respect to their decreasing influence on the re- 
sponse y-'(A), using as a criterion, the range of the s,(x',3j(A)) values of the trans- 
formed sample x' (see figure 3). One can also use the same criterion to prune the 
model, by eliminating the predictors and/or the interactions of low influence so as to 
obtain a more parsimonious model. 



2.2 MAPLSS-r2 chart 



Based on a generalization of PLS chart, taking into account not only the original pro- 
cess variables, but also their bivariate interactions, in this paper, we discuss the appli- 
cability of a new chart called MAPLSS-r^ chart. Following the procedure used for 
the construction of multivariate control charts based on projection methods like PCA- 
chart( Jackson, 1991), PLS-F^ chart (Kourti and MacGregor, 1996) and PLSS-F^ 
chart (Vanacore and Lombardo, 2005), the MAPLSS-L^ chart is based on the first A 
components. The MAPLSS-T^ chart is an effective monitoring tool: it incorporates 
the variability structure underlying process data and quality product data extracting 
information on the directionality of the process variation. The scores of each new 
observation are monitored by the MAPLSS-T^ control chart based on the following 
statistic 






( 6 ) 



where Xa and for a = 1, ...,A are the eigenvalues and the component scores, re- 
spectively, of the previously defined covariance matrix. The control limits of the 
MAPLSS-T^ chart are based on the percentiles qa (for a < 10%) of the empirical 
distributions, F^i, of MAPLSS component scores, computed on a large number N of 
bootstrap samples 

a = P{Tl<q^\FN). (7) 



Multivariate control charts can detect an unusual event but do not provide a reason 
for it. Following the diagnostic approach proposed by Kourti and MacGregor (1996) 
and using some new tools, we can investigate observations falling out of the limits 
through 

(1) bar plots of standardized out-of control scores {tajsfka for o.= 1, ...,A), to focus 
on the most important dimensions; 

(2) bar plot of the contributions of the process variables on the dimensions identified 
as the most important ones, to evaluate how each process variable involved in the 
calculation of that score contributes to it; 

(3) bar plot of the contributions of the process variables on product variables (mea- 
sured by the spline range) to evaluate the importance of process variables. 
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3 Application: monitoring the painting process of hot-rolled 
aluminium foils 

In this section we illustrate the usefulness of MAPLSS-T^ chart and the related di- 
agnosis tools by applying them to monitor a real manufacturing process. We focus 
on the modeling phase of statistical process control. The data refer to a manufac- 
turing firm of Naples, specialized in hot-rolling of aluminium foils. The manufac- 
turing process consists in simultaneously painting the lower and upper surfaces of 
an aluminium foil. The process starts by setting the aluminium roll on the unwind- 
ing swift. The aluminium foil, pulled by the draught rein that manages the crossing 
speed, reaches the painting station where it is uniformly painted on both surfaces by 
deflector rolls. The paint drying and polymerization is realized in a flotation oven 
consisting of 6 distinct modules (each module is characterized by a specific temper- 
ature and can be gradually boosted and independently tuned up). 

The process stops by rewinding the aluminium roll. The key product quality char- 
acteristics are the uniformity and stability of the alumium painting. Both of them 
depend on the Peak Metal Temperature, PMT , reached during the polymerization. 
By managing the temperatures of the stay of the aluminium foil in the oven, one can 
influence the PMT . Thus PMT has been selected as the only quality product vari- 
able, whereas the temperatures characterizing the six modules {T1,T2,T3,T4,T5,T6) 
and the post-combustion temperature (Tpost) have been selected as process vari- 
ables. The MAPLSS-T^ control chart is built on an historical data set of n = 100 
independent unit samples. The computational strategy consists in performing at first 
the MAPLSS regression (see Table 1) using low degree and knot number (degree=l, 
knots=l), deciding the dimension space A by Cross Validation (we get A = 3 with 
PRESS = 0.15). Using the balance between the goodness of fit (R^) and thriftness 
(PRESS), we select only one interaction among the candidates, the resulting best one 
is T4*T5. Afterwards we extract N = 500 Bootstrap samples and perform MAPLSS 




Fig. 1. MAPLSS— control chart. 




206 Rosaria Lombardo, Amalia Vanacore and Jean-Francjois Durand 




Fig. 2. Bar plot of contributions of process variables to the second dimension 




Fig. 3. Bar plot of contributions of process variables to PMT . 



regression procedure on each of them, having properly hxed the model parameters 
(degree=l, knots=l, A=3). The computation of the scores for all Bootstrap sam- 
ples allows to estimate the empirical distribution function of T^. We fix the con- 
trol chart upper and lower limits at the percentiles with a = 1% and a = 99% 
(UCL=393.03, LCL=2.81) 

Looking at the resulting control chart (see figure 1) we note two points out of control 
at the beginning of the sequence (points 5 and 13). They must be investigated, using 
bar plots (1) for points 5 and 13, the dimension 2 results as the most important one 
for both out of control points. The bar plot (2) of process variables which contribute 
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Table 1. MAPLSS results: fP' according to the dimension 



Dimension A 




%cum. 


1 


0.74 


74% 


2 


0.16 


89.6 


3 


0.03 


92.3 



to dimension 2 (figure 2) highlights that the most important process variables are the 
temperature in zone 4 (T4 ) zone 3 {T3), zone 2 (T2), zone 1 {Tl), the interaction 
between temperatures in zone 4 and zone 5 (T4*T5), ecc. In particular, T4 has a 
strong effect on dimension 2 as well as on the quality product variable (Fig. 3). In 
Fig. 3 we read in decreasing order the most important predictors on PMT, a part 
from T4, the other important process variables are Tl, T2, T6, T4*T5, and so on. 
It is interesting to observe that the interaction between T4*T5 is more important 
than the simple process variable given by T5 (Fig. 2 and 3). After the diagnosis 
analysis, the causes for observed out of control points have been detected. In fact 
the expert technologist suggests that the out-of-control signals are the consequence 
of a ‘transition phenomenon’ due to a calibration problem in the feedback of the 
automatic loop (i.e. the methane valve opens when temperature is naturally rising). 
Having identified and removed the causes for the out of control signals, the modeling 
phase of the MAPLSS-F^ chart requires that the control limits should be recomputed 
excluding the out of control points. The modeling phase ends when all points are 
inside the control limits. 



4 Conclusion 

In this paper a powerful non-parametric multivariate process control chart has been 
proposed for monitoring a manifacturing process. By simultaneously monitoring 
process and product variables, MAPLSS-T^ chart quickly detects and diagnoses un- 
usual events that may occur during the process. The proposed non-parametric control 
chart allows to handle collinear variables, missing values, outliers and interactions 
between variables, without imposing any distributional assumption. Further devel- 
opments of this work could be related to the construction of a chart of the Squared 
Prediction Error (SPE; Kourti and MacGregor, 1996) on MAPLSS model, in order to 
monitor any change in the covariance structure and verify that the process conditions 
during the monitoring stage are not different with respect to the time the in control 
MAPLSS model was developed. 
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Abstract. Simple Component Analysis (SCA) was introduced by Rousson and Gasser (2004) 
as an alternative to Principal Component Analysis (PC A). The goal of SCA is to find the 
“optimal simple system” of components for a given data set, which may be slightly correlated 
and suboptimal compared to PCA but which is easier to interpret. 

Aim of the present paper paper is to consider an extension of SCA to categorical data. 
In particular, we consider a simple version of the Non Symmetrical Correspondence Analy- 
sis (D’Ambra and Lauro, 1989). This latter approach can be seen as a centered PCA on the 
column profile matrix with suitable metrics enabling to describe the association in two way 
contingency table in cases where one categorical variable is supposed to be the explanatory 
variable and the other the response. 



1 Introduction 

It is well known that Principal Component Analysis (PCA) is optimal in at least 
two ways: principal components extract a maximum of the variability of the original 
variables and they are uncorrelated. The former ensures that a minimum of “total 
information” will be missed when looking at the first few principal components. The 
latter warrants that the extracted information will be organized in an optimal way: 
we may look at one principal component after the other, separately, without taking 
into account the rest. 

Unfortunately, principal components often lack interpretability. They define some 
abstract scores which often are not meaningful, or not well interpretable in practice. 
The same remark applies to all methods based on PCA. 

Simple Component Analysis (SCA) was introduced by Rousson and Gasser 
(2004) as an alternative to Principal Component Analysis. The goal of SCA was to 
find the “optimal simple system” of components for a given data set. A component 
was considered to be simple if the number of possibles values for its loadings was 
restricted to three (a positive one, zero and a negative one). Optimality of a syztem of 
components was defined as in Gervini and Rousson (2004). At the end, the optimal 
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simple system defined by SCA may be slightly correlated and suboptlmal compared 
to PCA but will be easier to interpret. Thus, SCA may represent a worth alternative 
to PCA if the loss of optimality remains modest. 

Aim of the present paper is to consider an extension of SCA to categorical data. 
In particular, we consider a simple version of the Non Symmetrical Correspondence 
Analysis (D’Ambra and Lauro, 1989). This latter approach can be seen as a PCA 
performed on the column prohle matrix with the same weighting system of Corre- 
spondence Analysis but in a different metric. 

Advantages of the method are illustrated with a well known data set. 



2 Non symmetrical correspondence analysis 

In many fields, the researcher is interested to study the relationship between two or 
more variables. When the variables are collected in a contingency table, classical 
statistic tools like correspondence analysis (CA) are applied in order to measure and 
visualize the strength of the association. 

The CA is based on the decomposition of the index (|)^ of Pearson, which is a 
symmetric measure of association. This approach however is no longer appropriate 
when one has to study a two way contingency table where one categorical variable is 
supposed to be the explanatory variable and the other the response. To overcome this 
problem, D’Ambra and Lauro (1989) introduced the Non Symmetrical Correspon- 
dence Analysis (NSCA). This approach decomposes the numerator of the Goodman- 
Kruskal x (1954), which is an asymmetric measure of association in a contingency 
table. 

Given two categorical variables 7 and J, the goal of NSCA is to evaluate the 
influence of categories of the explanatory variable J on the distribution of the reponse 
I. 

Let N = (riij) and P = ^ = {pij) = (^) be the absolute and relative two-way con- 
tingency table of dimension IxJ where I and J also denote the number of categories 
of the response and the explanatory variable, respectively, based on n individuals. Let 
Pi = Ylfj=\Pii P j — Yl!i=iPii *^he column and row marginals, respectively, 
and let Dy = diag{p,j). 

Finally, let 

n = {na) = {^-Pi) 

PJ 

be the matrix describing the conditional distribution of I given J. This matrix contains 
information on the 7 conditional distributions ^ adjusted to the row marginal /?, , and 
is hence a weighted average of the column profiles. 

From a geometrical point of view, the purpose of NSCA is to evaluate in the space 
the spread of the cloud of points dehned by 11 around its centroid according to an 
appropriate weighting system. A global measure of dispersion is given by the inertia 
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NSCA looks for the orthonormal basis which accounts for the largest part of iner- 
tia to visualize the dependence structure between J and / in a lower dimensional 
space. Solutions are given by the eigen-analysis of the variance covariance matrix 
S = riDjIl' whose general term (/, i') is given by 



j 

Ep. 

j=i 



Eii 

Pi 



■ Pi. 



Pi'j 

p.j 



■ Pi'. 



where pi denotes the centre of gravity of the ith row of P. This is achieved also by 
the generalized singular value decomposition of Ft = X)m=i with M* < 

M = min\{I,J) — 1] and where the scalar Xm is the singular value (we shall note 
A = diag{Xm)), 3/11 and bm are orthonormal singular vectors in an unweighted and 
weighted metric, respectively, such that = 1, = 0 and bJ„Djbm = 1, 

bJ„D,bm/ = 0 for m ^ m'. 

In the previous decomposition, the numerator of the Goodman and Kruskal t 
( 1954) can be decomposed as = X)m=i 

The factorial row and column coordinates are given by \|/m = and cpm = 

respectively. Finally, factorial coordinates can be also obtained from 
the transition formulae: 




See D’Ambra and Lauro (1989) for further details and remarks. 



3 Simple non symmetrical correspondence analysis 

It is possible to show that NSCA corresponds to a PCA of the profile matrix Tt with 
suitable row and column metrics. This is equivalent (Tenenhaus and Young, 1985) to 
study the statistical triplet (II, I, D,) where the identity matrix I denotes the metric 
and D, the weighting system. Thus, like all PCA-based methods, the components 
produced by NSCA are optimal but may lack interpretability, as recalled in the In- 
troduction. 

In the similar way as SCA was introduced as an alternative to PCA, we shall now 
introduce a technique called Simple NSCA as an alternative to NSCA. For this, we 
shall use similar concepts and algorithms as in SCA. Note that while one makes the 
distinction between simple block-components and simple difference-components in 
SCA, we shall here consider only difference components (i.e. components with both 
positive and negative loadings), since NSCA does not produce block-components 
(i.e. components where all loadings share the same sign). Thus, we shall consider 
simple components with loadings proportional to vectors with only three different 
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values (a negative value, zero, and a positive value), the sum of the loadings being 
zero for each component (defining hence proper contrasts of categories). 

The goal of Simple NSCA is to find the optimal system of components among 
the simple ones, where optimality is calculated according to Gervini and Rousson 
(2004). 

The percentage of extracted variability V(L) accounted by a system h of m = 
min{I,J) — 1 components is given by 



V(L) = 



1 



i;sii 

tr{A) tr{A) 



m 

k=2 



where is the kfh column of L, and where L(*:-i) is the m x (fc — 1) matrix containing 
the first {k — 1) columns of L. 

Whereas the numerator of the first term of this sum is equal to the variance of the 
hrst component, the numerator of the A:th term can be interpreted as the variance of 
the part of the A:th component which is not explained by (which is independent from) 
the previous (^ — 1) components. Thus, correlations are "penalized" by this criterion 
which is hence uniquely maximized by PC A, i.e. by taking L = E^, the matrix of the 
first m eigenvectors of S (Gervini and Rousson, 2004). The optimality of a system L 
is then calculated as V(L) /¥(£„). 

In our sequential algorithms below, the kth simple component is obtained by 
regressing the original row/column categories on the previous k—\ simple compo- 
nents already in the system, by computing the first eigenvector of the residual vari- 
ance hence obtained, and by shrinking this eigenvector towards the simple difference 
component which maximizes optimality. Here are two algorithms providing simple 
components for the rows and the columns. 



Simple solutions for the rows 

1. Let S = nDyfl', let L be an empty matrix and let S = S. 

2. Let a = (ai , . . . , a/)' be the first eigenvector of S. 

3. For each cut-off value among g = {0, |ai |, . . . , |a/|}, consider the shrunken vector 

b(g) = {bi(,g),-M{g)}' with elements bk{g) = sign(aj:) if \ak\ > g and bk(g) = 
0 otherwise (for k = 1, . . . ,/). Update and normalize it such that = 0 and 

Eb!(g) = l. 

4. Include into the system the difference component b(g) which maximizes 
b(g)'Sb(g) (i.e. add the column b(g) to the matrix of loadings L). 

5. If the maximum number of components is attained stop. Otherwise let S = S — 
SL(L'SL)^*L'S and go back to step 2. 

Simple solutions for the columns 

1. Let S = D|/^n'nDy^, let L be an empty matrix and let S = S. 

2. Let a = (fli, . . . ,aj)' be the first eigenvector of S. 
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3 . For each cut-off value among g = {0,|ai|,...,|ay|}, consider the shrunken vector 

b(g) = with elements bk{g) = sign(aj:) if \at\ > g and bk{g) = 

0 otherwise (for k= 1, ... ,7). Update and normalize it such that = 0 and 

Y.bl{g) = l. 

4. Include into the system the difference component b(g) which maximizes 
b(g)'Sb(g) (i.e. add the column b(g) to the matrix of loadings L). 

5. If the maximum number of components is attained, let L = L and stop. 
Otherwise let S = S — SL(L'SL)^^L'S and go back to step 2. 

4 Father’s and son’s occupations data 

To illustrate the technique of Simple NSCA, we applied it to the well known Father’s 
and Son’s Occupations. This data set (Perrin, 1904) was collected to study whether 
and how the professional occupation of some man depends on the occupation of his 
father. Occupations of 1550 men were cross-classified according to father’s and son’s 
occupation reparted into 14 occupations. 

The conclusion of the study was that such a dependence existed. Two measures 
of predicability, the Goodman-Kruskal’s t (1954) and the Light and Margolin’s C = 
(n — 1)(/— l)x (1971), have been computed. Note that the C-statistic can be used 
to formally test for association, being asymptotically chi-squared distributed with 
(/ — 1)(7 — 1) degrees of freedom under the hypothesis of no association (Light and 
Margolin, 1971). 

The overall increase in predicability of a man’s occupation when knowing the oc- 
cupation of his father was equal to 14% (t = 0.14; C = 2880.8; df = 169, 
( 0 . 0001 ). 

According to the NSCA decomposition of the numerator of x (x„„m = Yl!k=\ ~ 
0.1288), we have for the first two axes = 0.24 and X 2 = 0.16, which are the 
weights of the axes in the joint plot of Figure 1. The first axis accounts for 100 x 
(0.24)^/0.1288 = 43.7% of the dependence between the two variables while the 
second one represents 20.7%. Therefore Figure 1 accounts for 64.4% of the total 
inertia. 

Unfortunately, the two-dimensional NSCA solution (Figure 1) does not give a 
clear description of the dependence of the two variables as well as of the association 
between rows and columns. Thus, NSCA is difficult to interpret and a simple solution 
has been calculated according to Simple NSCA. 

From Table 1, one can see that the first component defined by Simple NSCA for 
the rows contrasts son’s occupation “Art” versus the group of occupations {Army, 
Divinity, Law, Medicine, Politics & Court and Scholarship & Science}. This simple 
component explains 42.5% of the variance compared to 43.7% for optimal solution 
above. Thus, the first simple row solution is 42.5%/43.7%=97.4% optimal. One can 
conclude that the influence of falher’s occupation on son’s occupation mainly con- 
trasts these two groups of occupation. The second simple row solution provided by 
Simple NSCA contrasts son’s occupation “Divinity” versus the group of occupations 
(Army and Politics & Court). 
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Army 

Art 
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Crafts 
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Agriculture 

Landownership 
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Literature 

Commerce 

Medicine 

Navy 

Politcs & Court 
Scolarship and Science 



Fig. 1. Non Symmetrical Correspondence Analysis (NSCA): Joint plot. 



The same table also contains the Simple NSCA solution for the columns. The 
first simple column solution contrasts father’s occupation “Art” versus “Divinity”, 
and is 81.9% optimal. The second simple column solution contrast groups of father’s 
occupations {Army, Landownership, Law and Politics & Court} versus {Art and 
Divinity) with an optimality value of 90.4%. Similarly, further simple constrats can 
be defined for both the rows and the columns (see Table 1 for the first 5 solutions). 
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Table 1. Simple NSCA solutions for the first five axes. 





SON (row) 


FATHER (column) | 




Axisl 


Axis2 


Axis3 


Axis4 


Axis5 


Axisl 


Axis2 


Axis3 


Axis4 


Axis5 


Army 


0,15 


-0,41 


-0,44 


-0,37 


-0,50 


0,00 


-0,89 


-1,20 


3,21 


0,00 


Art 


-0,93 


0,00 


0,00 


0,00 


0,00 


-2,04 


1,77 


-1,20 


0,00 


0,00 


TCCS 


0,00 


0,00 


0,00 


0,00 


0,00 


0,00 


0,00 


0,00 


0,00 


0,00 


Crafts 


0,00 


0,00 


0,00 


0,00 


0,00 


0,00 


0,00 


0,86 


0,00 


0,00 


Divinity 


0,15 


0,82 


-0,44 


0,00 


0,00 


2,04 


1,77 


-1,20 


0,00 


0,00 


Agricolture 


0,00 


0,00 


0,00 


0,00 


0,00 


0,00 


0,00 


0,86 


0,00 


0,00 


Landownership 


0,00 


0,00 


0,00 


0,00 


0,00 


0,00 


-0,89 


-1,20 


0,00 


0,00 


Law 


0,15 


0,00 


0,33 


0,55 


-0,50 


0,00 


-0,89 


0,86 


-1,61 


-2,65 


Literature 


0,00 


0,00 


0,33 


0,00 


0,00 


0,00 


0,00 


0,86 


0,00 


0,00 


Commerce 


0,00 


0,00 


0,33 


0,00 


0,00 


0,00 


0,00 


0,86 


0,00 


0,00 


Medicine 


0,15 


0,00 


0,00 


-0,37 


0,50 


0,00 


0,00 


0,86 


0,00 


2,65 


Navy 


0,00 


0,00 


0,00 


0,00 


0,00 


0,00 


0,00 


0,00 


0,00 


0,00 


POLCOURT 


0,15 


-0,41 


-0,44 


0,55 


0,50 


0,00 


-0,89 


-1,20 


-1,61 


0,00 


SCSCIENCE 


0,15 


0,00 


0,33 


-0,37 


0,00 


0,00 


0,00 


0,86 


0,00 


0,00 


1 Explained variance ( 


;%) 1 


Optimal solu- 


43,70 


64,40 


75,30 


83,00 


89,20 


43,70 


64,40 


75,30 


83,00 


89,20 


tion 






















Simple solu- 


42,50 


62,20 


72,30 


79,70 


85,70 


35,80 


58,20 


68,50 


75,10 


80,30 


tion 






















Optimality 


97,40 


96,60 


96,10 


96,10 


96,10 


81,90 


90,40 


91,00 


90,50 


90,00 


1 Note: TCCS, POLCOURT and SCSCIENCE stand for 


“Teacher, Clerck and Civil 




Servant”, “Politics & Court” and “Scolarship & Science”, respectively. 







To better summarize and visualize the relationship between father’s and son’s 
occupation, it is helpful to plot the solutions for rows and columns for each axis on a 
same graphic (Figure 2). One can see that the first Simple NSCA solution highlights 
the fact that a son has the tendency to choose the same occupation as his father if 
this occupation is “Art”, while father’s occupation “Divinity” is linked with a son’s 
occupation within {Army, Divinity, Law, Medicine, Politics & Court and Scholarship 
& Science}. Similarly, one can try to interpret the second Simple NSCA solution. 

In summary. Simple NSCA provides a clearcut picture of the situation, the opti- 
mality of the first two axes being in this example of more than 95% (for the rows) 
and 90% (for the columns). Thus, the price to pay for simplicity is about 5% (for the 
rows) and 10% (for the columns), which is not much. In this sense. Simple NSCA 
may be a worth alternative to NSCA. 



5 Conclusions 



In general, all PCA-based methods are tuned to condense information in an optimal 
way. However, they define some abstract scores which often are not meaningful or 
not well interpretable in practice. This was also the case in our example above for 
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Fig. 2. Summary of Simple NSCA solutions for the axes 1 and 2. 



NSCA. To enhance interpretability, Simple NSCA focus on simplicity and seeks 
for “optimal simple components”, as Illustrated in our example. It provides a clear- 
cut Interpretation of the association between rows and columns, the price to pay 
for simplicity being relatively low. In this sense. Simple NSCA may be a worth 
alternative to NSCA. Extensions of this approach for the Classical Correspondence 
Analysis and for ordinal variables are under Investigation. 
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Abstract. The aim of this work is to present a method of joint factorial analysis of several 
contingency tables. This method that we have called Simultaneous Analysis (SA), is especially 
appropriate to analyze frequency tables whose row margins are different, for example when 
the tables are from different samples or different time points. Furthermore, SA may be applied 
to the joint analysis of more than two data tables in which rows refer to the same entities, but 
columns may be different. 

SA allows us to maintain the structure of each table in the overall analysis by centering 
each table internally with its margins, as is done in Correspondence Analysis (CA) and pro- 
vides a joint description of the different structures contained within each table. Besides jointly 
studying the intrastructure of the tables, SA permits an overall comparison of the similarities 
and differences between the tables. 



1 Introduction 

The need of jointly analyzing several contingency tables has produced several facto- 
rial methods. 

Some of the proposed methods consist in the analysis of the table obtained as 
sum of the separated contingency tables and/or the analysis of the table obtained as 
juxtaposition of the initial tables (Gazes (1980) and (1981)) and the Intra Analysis 
(Escofier (1983)). Nevertheless, in Zarraga and Goitisolo (2002) it is shown that there 
are situations where none of these methods permits an analysis of the similarities 
among rows that mantains the similarity in the analyses of the separated tables. 

The aim of this work is to present a factorial method for the joint analysis of sev- 
eral contingency tables that allows, in a similar way to correspondence analysis, the 
study of the similarity among the set of rows, of columns and the relations between 
both sets. 

Also cite the non symmetrical analysis (D’ Ambra and Lauro (1984) and Lauro 
and D’ Ambra (1989)) and more recently the Multiple Factor Analysis for Contin- 
gency Tables (Pages and Becue-Bertaut (2006)). 
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2 Methodology 



Let T = be the set of contingency tables to be analyzed. Each of 

them classifies the answers of « ^ individuals with respect to two categorical vari- 
ables. All the tables have one of the variables in common, in this case the row vari- 
able with categories I = { 1 7} . The other variable of each contingency table 
can be different or the same variable observed at different time points or in different 
subsamples. On concatenating all these contingency tables, a joint set of columns 
J = { 1 , . . . , 7, . . . , 7} is obtained. The element ntp corresponds to the total number of 
individuals who choose simultaneously the categories / € I of the first variable and 
j G Jr of the second variable, for table t G T. Sums are denoted in the usual way, for 
example, r = n denotes the grand total of all T tables. 

In order to maintain the internal structure of each table t, SA begins by obtaining 
the relative frequencies of each table as usually done in CA: /?7. = n,yr/«..r so that 
5Z/GJ, Pi‘j — 1 table t. It is important to keep in mind that these relative 

frequencies are different from those obtained when calculating the relative frequency 
for the whole matrix: pijt = riijtln. 

The method that we propose is carried out in three stages. 

2.1 Stage one: CA of each contingency table 

Since in SA it is important for each table to maintain its own structure, the first 
stage carries out a classical CA of each of the T contingency tables. These separate 
analyses also allow us to check for the existence of structures common to the different 
tables. From these analyses it is possible to obtain the weighting used in the next 
stage. 

CA on the t-th contingency table can be carried out by calculating the singular 
value decomposition (SVD) of the matrix X‘ , whose general term is: 




Let DJ. and DJ, be the diagonal matrices whose diagonal entries are respectively the 
marginal row frequencies pf and column frequencies p*j. From the SVD of each 
table X‘ we retain the first squared singular value (or eigenvalue, or principal inertia), 
denoted by X\ . 

2.2 Stage two: analysis of infrastructure 

In the second stage, in order to balance the influence of each table in the joint analy- 
sis, measured by the inertia, and to prevent this joint analysis from being dominated 
by a particular table, S A will include a weighting on each table, at . With this aim, in 
S A, = 1 /X [ , where denotes the first eigenvalue (square of first singular value) 
of the separate CA of table t (stage one). This weight is similar to the one used in 
Multiple Factor Analysis (MFA) (Escofier and Pages (1988)). 




Factorial Analysis of a Set of Contingency Tables 221 



As a result, SA proceeds by performing a principal component analysis (PCA) 
of the matrix X,X = 

The PCA results are also obtained using the SVD of X, giving singular values 
Vh on the s-th dimension and corresponding left and right singular vectors u« and 

Vi. 

We calculate projections on the i-th axis of the columns as principal coordinates 
gi, g.s = Vi where {J x /), is a diagonal matrix of all the column masses, 

that is all the Dj,. 

One of the aims of the joint analysis of several data tables is to compare them 
through the points corresponding to the same row in the different tables. These points 
will be called partial rows and denoted by i‘ . 

The projection on the ^-th axis of each partial row is denoted by and the vector 
of projections of all the partial rows for table t is denoted by f,, f, = 
(Dj.)-'/' [0 ... y^X‘ ...0] Vi 

Especially when the number of tables is large, comparison of partial rows is 
complicated. Therefore each partial row will be compared with the (overall) row, 
projected as fi = [y/a[X^ ... ...y/<^X^] Vi = X Vi where 

Dvv is the diagonal matrix whose general term is choice of this matrix 

Dvv allows us to expand the projections of the (overall) rows to keep them inside the 
corresponding set of projections of partial rows, and is appropriate when the partial 
rows have different weights in the tables. With this weighting the projections of the 
overall and partial rows are related as follows: 



fis 



Y' 



f- 

J IS 



So the projection of a row is a weighted average of the projections of partial rows. It 
is closer to those partial rows that are more similar to the overall row in terms of the 
relation expressed by the axis and have a greater weight than the rest of the partial 
rows. The dispersal of the projections of the partial rows with regard to the projection 
of their (overall) row indicates discrepancies between the same row in the different 
tables. 

Notice that if p\ is equal in all the tables then that is the 

overall row is projected as the average of the projections of the partial rows. 



Interpretation rules for simultaneous analysis 



In SA the transition relations between projections of different points create a simul- 
taneous representation that provides more detailed knowledge of the matter being 
studied. 

Relation between /f and gjs'. The projection of a partial row on axis s depends 
on the projections of the columns: 



f‘ 

J IS 



y/«7 



p'. 



8js 




222 Amaya Zarraga and Beatriz Goitisolo 



Except for the factor ^/a^Jks, the projection of a partial row on axis 5, is, as in CA, 
the centroid of the projections of the columns of table t. 

Relation between fis and gjs'- The projection of an overall row on axis s may be 
expressed in terms of the projections of the columns as follows: 



As — SreT 




At 




The projection of the row is therefore, except for the coefficients y/at/Xs, the 
weighted average of the centroids of the projections of the columns for each table. 

Relation between gjs and fis or The projection on the axis s, of the column j 
for table t, can be expressed in the following way: 



8js 






-pp, 



pp, 




This expression shows that the projection of a column is placed on the side of 
the projections of the rows with which it is associated, compared to the hypothesis 
of independence, and on the opposite side of the projections of those to which it is 
less associated. 

This projection is, according to partial rows: 



8js — 









The same aids to interpretation are available in SA as in standard factorial anal- 
ysis as regards the contribution of points to principal axes and the quality of display 
of a point on axis s. 



2.3 Stage three: comparison of the tables: interstructure 

In order to compare the different tables, SA allows us, to represent each of them by 
means of a point and to project them on the axes. 

The coordinate of table t on axis s, fts, represents the projected inertia of the table 
on the axis and, therefore, indicates the importance of the table in the determination 
of the axis. Thus, fts = X);gj, p‘j 8% = Inertia, (t) where Inertia, (r) represents the 
projected inertia of the sum of columns of the table t on the axis s. 

Due to the weighting of the tables chosen by SA, the maximum value of this 
inertia on the hrst axis is 1 . A value of fs close to 0 would indicate orthogonality 
between the first axes of the separate analyses with regard the Simultaneous Anal- 
ysis. A value of fs close to 1 would Indicate that the axis of the joint analysis is 
approximately the same as in the separate analysis of each table. So, if all the tables 
present a coordinate close to the maximum value, 1, on the first factorial axis of the 
SA, the projected inertia onto it is approximately T, the number of tables, and this 
confirms that this hrst direction is accurately depicting the relevant associations of 
each table. 
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2.4 Relations between factors of the analyses 



In SA it is also possible to calculate the following measurements of the relation 
between the factors of the different analyses. 

Relation between factors of the individual analyses: The correlation coefficient 
can be used to measure the degree of similarity between the factors of the separate 
CA of different tables. This is possible when the marginals p\ are equal. 

When p\ are not equal, Cazes (1982) proposes calculating the correlation coef- 
ficient between factors, assigning weight to the rows corresponding to the margins 
of one of the tables. Therefore, these weights, and the correlation coefficient as well, 
depend on the choice of this reference table. In consequence, we propose to solve this 
problem of the weight by extending the concept of generalized covariance (Meot and 
Leclerc (1997)) to that of generalized correlation (Zarraga and Goitisolo (2003)). 

The relation between the factors s and f of the tables t and t' respectively would 
be calculated as: 



r(f,,,f*/) = X;,-GI 



fist 




where and fi^ti are the projections on the axes 5 and s' of the separate CA of 
the tables t and t' respectively and where Z' and are the inertias associated with 
these axes. This measurement allows us to verify whether the factors of the separate 
analyses are similar and check the possible rotations that occur. 

Relation between factors of the SA and factors of the separate analyses: Like- 
wise, it is possible to calculate for each factor s of the SA, the relation with each of 
the factors s' of the separate analyses of the different tables: 



— Z)(Gl \/K (Z)rGT\/^) 



fis 



If all the tables of frequencies analysed have the same row weights this measure- 
ment is reduced to: 



'■(f.s'rjf.i') — X)(Gl 



p\. fis't fis 

^JsZiaP’, (A/OVS-eiA- ifis? 



that is, the classical correlation coefficient between the factors of the separate analy- 
ses and the factors of SA. 



3 Application 

In this section we apply SA to the data taken from an on-line survey drawn up by the 
Spanish Ministry of Education and Science, from January to March 2006, to Spanish 
students who participate in the Erasmus program in European universities. 

This application presents a comparative study for Spanish students, according to 
gender, of the relationships between the countries that they choose as destination to 
carry out the university interchange in the Erasmus program and the scientific fields 
in which they are studying. 
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The 15 countries that they choose as destination are Austria, Belgium, Czech 
Republic, Denmark, Finland, France, Germany, Ireland, Italy, Netherlands, Norway, 
Poland, Portugal, Sweden and United Kingdom. The scientific fields in which they 
are studying are: Social and Legal Sciences, Engineering and Technology, Flumani- 
ties, Flealth Science and Experimental Science. 

Therefore, we have two data tables whose rows (countries) and columns (sci- 
entific fields) correspond to the same modalities but refer to two different sets of 
individuals, depending on their gender. In these tables both the marginals and the 
grand-totals are different. This fact suggests analyzing the tables by SA since the re- 
sults of applying other methods can be affected by the above mentioned differences 
(Zarraga and Goitisolo (2002)). 

The first factorial plane of SA (figure 1) explains nearly 60% of total inertia. In 
the plane we observe that male and female students of Humanities Area, Health Sci- 
ence and specially Engineering and Technology have a similar behavior in the choice 
of the country of destination to realize their studies, whereas students of Social and 
Legal Sciences and of Experimental Science choose different countries as destiny 
depending on their gender. 

The plane shows that students of Humanities Area, both male and female, choose 
the United Kingdom as destiny country, followed by Ireland. The countries chosen 
as destiny for students of both gender of Engineering and Technology are mainly 
Austria, Sweden and Denmark. Einally, the males and females students of Health 
Science Area prefer Portugal and Finland. 

The students of Experimental Science Area select different countries to realize 
the interchange depending on their gender. While male students go mainly to Portu- 
gal and Netherlands, females go to Norway. 

Also students of Social and Legal Sciences Area have a different behavior. The 
Netherlands and Ireland are selected as destiny country by males and females but 
males also go to Belgium, the United Kingdom and Italy while females do it to 
Norway and Sweden. 

The projection of partial rows of each table, joined by segments, allows us to 
appreciate the differences between males and females in each destiny country. We 
will only remark some of them. 

For example. United Kingdom is a country to which males and females students 
go in a greater proportion among the students of Humanities. Nevertheless males 
also choose United Kingdom to carry out Social and Legal studies whereas females 
do not. 

Male and female students that come to Portugal agree in selecting this country 
over the average for Health degrees. But, males also go to Portugal to study Ex- 
perimental Science while females prefer this country for studies of Engineering and 
Technology. 

Spanish students who go to Einland share the selection of this country over the 
rest of the countries to study in the areas of Health and Engineering but there are 
more females in the former area and males in the last one. 
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Factor 1 (36%) 

Fig. 1. Projection of columns, overall rows and partial rows 



In the other hand, not big differences between males and females are found in 
Germany, France, Belgium and Norway as it is indicate by the close projections of 
overall and partial rows. 

As conclusion of this application we can say that Simultaneous Analysis allows 
us to show the common structure inside each table as well as the differences in the 
structure of both tables. A more extensive application to the joint study of the inter 
and intra- structure of a bigger number of contingency tables can be found in Zarraga 
and Goitisolo (2006). 



4 Discussion 

The joint study of several data tables has given rise to an extensive list of factorial 
methods, some of which have been gathered by Gazes (2004), for both quantitative 
and categorical data tables. In the correspondence analysis (CA) approach Gazes 
shows the similarity between some methods in the case of proportional row mar- 
gins and shows the problem that arises in a joint analysis when the row margins are 
different or not proportional. 

Comments on the appropriateness of SA and a comparison with different meth- 
ods, especially with Multiple Factor Analysis for Contingency Tables (Pages and 
Becue-Bertaut (2006)), in the cases where row margins are equal, proportional and 
not proportional between the tables can be found in Zarraga and Goitisolo (2006). 





226 Amaya Zarraga and Beatriz Goitisolo 

5 Software notes 



Software for performing Simultaneous Analysis, written in S-Plus 2000 can be found 
in Goitisolo (2002). The AnSimult package for R can be obtained from the authors. 
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Abstract. In frequent subgraph mining one tries to find all subgraphs that occur with a user- 
specified minimum frequency in a given graph database. The basic approach is to grow sub- 
graphs, adding an edge and maybe a node in each step, to count the number of database graphs 
containing them, and to eliminate infrequent subgraphs. The predominant method to avoid re- 
dundant search (the same subgraph can be grown in several ways) is to define a canonical form 
that uniquely identifies a graph up to automorphisms. The obvious alternative, a repository of 
processed subgraphs, has received fairly little attention yet. However, if the repository is laid 
out as a hash table with a carefully designed hash function, this approach is competitive with 
canonical form pruning. In experiments we conducted, the repository-based approach could 
sometimes outperform canonical form pruning by 15%. 



1 Introduction 

Frequent subgraph mining consists in the task to find all subgraphs that occur with a 
user-specified minimum frequency in a given database of (attributed) graphs. Since 
this problem appears in applications in biochemistry, web mining, and program flow 
analysis, it has attracted a lot of attention, and several algorithms were proposed to 
tackle it. Some of them rely on principles from inductive logic programming and 
describe graphs by logical expressions (Finn et al. 1998). Flowever, the vast ma- 
jority transfers techniques developed originally for frequent item set mining. Ex- 
amples include MolFea (Kramer et al. 2001), FSG (Kuramochi and Karypis 2001), 
MoSS/MoFa (Borgelt and Berthold 2002), gSpan (Yan and Flan 2002), Closegraph 
(Yan and Flan 2003), FFSM (Huan et al. 2003), and Gaston (Nijssen and Kok 2004). 
A related, but slightly different approach is used in Subdue (Cook and Flolder 2000). 

The basic idea of these approaches is to grow subgraphs into the graphs of the 
database, adding an edge and maybe a node (if it is not already in the subgraph) in 
each step, to count the number of graphs containing each grown subgraph, and to 
eliminate infrequent subgraphs. All found frequent subgraphs are reported (or often 
only the subset of so-called closed subgraphs). 

While in frequent item set mining it is trivial to ensure that each item set is 
checked only once, it is a core problem in frequent subgraph mining how to avoid 
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redundant search. The reason is that the same subgraph can be grown in several 
ways, namely by adding the same nodes and edges in different orders. Although 
multiple tests of the same subgraph do not invalidate the result of a subgraph mining 
algorithm, they can be devastating for its execution time. 

One of the most elegant ways to avoid redundant search is to define a canonical 
description of a (sub)graph. Combined with a specific way of growing the subgraphs, 
such a canonical description can be used to check whether a given subgraph has 
been considered in the search before. For example, Borgelt (2006) studied a family 
of such canonical forms, which comprises the special forms used in gSpan (Yan 
and Han 2002) and Closegraph (Yan and Han 2003) as well as the one underlying 
MoSS/MoFa (Borgelt and Berthold 2002). 

However, canonical form pruning is not the only way to avoid redundant search. 
A simpler and much more straightforward approach is a repository of already pro- 
cessed subgraphs, against which each grown subgraph is checked. Nevertheless this 
approach is rarely used, has actually not even been properly investigated yet. To 
our knowledge only two existing algorithms use a repository, namely MoSS/MoFa, 
which prunes with a canonical form by default, but offers the optional use of a repos- 
itory, and Gaston (Nijssen and Kok 2004), in which a repository is used in the final 
phase for general graphs, since Gaston’s canonical form is restricted to trees. In order 
to close this gap, this paper examines repository-based pruning and compares it to 
canonical form pruning. Surprisingly enough, a repository-based approach is highly 
competitive and could sometimes outperform canonical form pruning by 15%. 



2 Canonical form pruning 

The core idea underlying a canonical form is to construct a code word that uniquely 
identifies a graph up to automorphisms. The characters of this code word describe 
the connection structure of the graph. If the graph is attributed (labeled), they also 
comprise information about edge and node attributes. While it is straightforward 
to capture the attribute information, it is less obvious how to describe the connec- 
tion structure. For this, the nodes of the graph must be numbered (more generally: 
endowed with unique labels), because we need to specify the source and the desti- 
nation node of an edge. Unfortunately, different ways of numbering the nodes of a 
graph yield different code words, because they lead to different descriptions of an 
edge (simply because the indices of source and destination node differ). In addition, 
the edges can be listed in different orders. Different possible solutions to these two 
problems give rise to different canonical forms (see Borgelt (2006) for details). 

However, given a (systematic) way of numbering the nodes of a graph and a 
sorting criterion for the edges, a canonical description is derived as follows: each 
numbering of the nodes yields a code word, which is the concatenation of the sorted 
edge descriptions. The resulting code words are sorted lexicographically. The lexico- 
graphically smallest code word is the canonical description. (It should be noted that 
the graph can be reconstructed from this code word.) 
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Canonical code words are used in the search as follows: the process of growing 
subgraphs is associated with a way of building code words for them. Most naturally, 
the code word of a subgraph is obtained by simply concatenating the descriptions 
of its edges in the order in which they are added in the search. Since each possible 
subgraph needs to be checked only once, we may choose to process it only in the 
node of the search tree, in which its code word (as constructed by the search) is the 
canonical code word. Otherwise the subgraph (and thus the search tree rooted at it) 
is pruned. 

It follows that we cannot use just any possible canonical form. If extended code 
words are built by appending the next edge description to the code word of the cur- 
rent subgraph, then the canonical form must have the so-called prefix property: any 
prefix of a canonical code word must be a canonical code word itself. Since we plan 
to extend only graphs in canonical form, the prefix property is needed to ensure that 
all possible subgraphs can be reached in the search. A simple way to ensure that a 
canonical form has the prefix property is to confine oneself to spanning tree number- 
ings of the nodes of a graph. 

In a straightforward algorithm (the code words of) all possible extensions of a 
subgraph are created and checked for canonical form. Extensions in canonical form 
are processed further, the rest is discarded. However, canonical forms also give rise 
to restrictions of the extensions of a subgraph, because for certain extensions one can 
see immediately that they lead to a non-minimal code word. For the two most impor- 
tant canonical forms, namely those that are based on a breadth-first (MoSS/Mofa) 
and a depth-first spanning tree numbering (gSpan/Closegraph), these are (for details 
see Borgelt (2006)): 

• maximum source extensions 

Only nodes having an index no less than the maximum source of an edge may be 
extended (the source of an edge is the node with the smaller index). 

• rightmost path extensions 

Only the nodes on the rightmost path of the spanning tree used for numbering 
the nodes may be extended (children of a node are sorted by index). 

While reasons of space prevent us from reviewing details, restricted extensions are 
important to mention here. The reason is that they can be exploited for the repos- 
itory approach as well, because they are an inexpensive way of avoiding most of 
the redundancy imminent in the search. (Note, however, that they cannot rule out all 
redundancy, as there are no perfect “simple rules”.) 



3 Repository of processed subgraphs 

A repository of processed subgraphs is the most straightforward way of avoiding 
redundant search. Every encountered frequent subgraph is stored in a data structure, 
which allows us to check quickly whether a given subgraph is contained in it or not. 
Whenever a new subgraph is created, this data structure is accessed and if it contains 
the subgraph, we know that it has already been processed and thus can be discarded. 
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Only subgraphs that are not contained in the repository are extended and, of course, 
inserted into the repository. 

There are two main issues one has to address when designing such a data struc- 
ture. In the first place, we have to make sure that each subgraph is stored using a 
minimal amount of memory, because the number of processed subgraphs is usually 
huge. (This consideration may be one of the main reasons why a subgraph repository 
is so rarely used.) Secondly, we have to make the containment test as fast as possible, 
since it will be carried out frequently. 

In order to achieve the first objective, we exploit that we only want to store graphs 
that appear in at least one graph of the database (which usually resides in memory 
anyway). Therefore we can store a subgraph by listing the edges of one embedding 
(that is, one occurrence of the subgraph in a graph of the database). Note that it 
suffices to list the edges, since the search is usually restricted to connected subgraphs 
and thus the edges also identify all nodes.' 

It is pleasing to observe that this way of storing a subgraph can also make it 
easier to check whether a given subgraph is equivalent to it (isomorphism test). The 
rationale is to fix an order of the database graphs and to create the embeddings of all 
subgraphs in this order. Then we do not store an arbitrary embedding, but one into 
the hrst database graph it is contained in. For a new subgraph, for which we want 
to know whether it is in the repository, we can then check whether the first database 
graph containing it coincides with the one underlying the stored embedding. If it 
does not, we already know that the subgraphs (the new one and the stored one to 
which it is compared) cannot be equivalent, since equivalent subgraphs have the 
same embeddings. 

Flowever, if the database graphs coincide, we carry out the actual isomorphism 
test by also relying on the embeddings. We mark the embedding that is stored in the 
repository (that is, its edges) in the containing database graph. Then we traverse all 
embeddings of the new subgraph into the same graph^ and check whether for any 
of them all edges are marked. If such an embedding exists, the two subgraphs (the 
new one and the stored one) must be equivalent, otherwise they differ. Obviously, 
this isomorphism test is linear in the number of edges and thus very efficient. It 
should be kept in mind, though, that it can be costly if a subgraph possesses a large 
number of embeddings into the same graph, because in the worst case (that is, if 
the two subgraphs are not isomorphic) all of these embeddings have to be checked. 
However, our experiments showed that this is an unlikely case, since especially larger 
subgraphs most of the time possess only a single embedding per database graph. 

Even though an isomorphism test of the described form is fairly efficient, one 
should try to avoid it. Apart from the obvious checks whether the number of nodes 
and edges, the support in the graph database and the number of embeddings coin- 



* The only exception are subgraphs consisting of a single node. Fortunately, such subgraphs 
need not be stored, since they cannot be created in more than one way, thus making it 
unnecessary to check whether they have been processed before. 

^ This is straightforward in our implementation, since in order to facilitate and accelerate 
forming extensions, we keep a list of all embeddings of a subgraph. 
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cide (naturally these must all be equal for isomorphic subgraphs), we employ a hash 
function that is computed from local graph properties. The basic idea is to com- 
bine the node and edge attributes and the node degrees, hoping that this allows us 
to distinguish non-isomorphic subgraphs. In particular, we combine for each edge 
the edge attribute and the attribute and degree of the two incident nodes into a num- 
ber. For each node we compute a number from the node attribute, the node degree, 
the attributes of its incident edges and the attributes of the other nodes these edges 
are incident to. These numbers (one for each node and one for each edge) are then 
combined with the total numbers of nodes and edges to yield a hash code.^ 

The computed hash code is used in the standard way to build a hash table, thus 
making it possible to restrict the isomorphism test to (a subset of) the subgraphs in 
one hash bin (a subset, because some collisions can be resolved by comparing the 
support etc., see above). By carefully tuning the parameters of the hash function we 
tried to minimize the number of collisions. 



4 Comparison 

Considering how canonical form pruning and repository-based pruning work, we 
can make the following observations, which already give hints w.r.t. their relative 
performance (and which we use to explain our experimental findings): 

Canonical form pruning has the advantage that we only have to carry out one test 
(for canonical form) In order to determine whether a subgraph needs to be processed 
or not (even though this test can be expensive). It has the disadvantage that it is most 
costly for the subgraphs that are in canonical form (and thus have to be processed), 
because for these subgraphs all possibilities to construct a code word have to be tried. 
For non-canonical code words the test usually terminates earlier, since it can often 
construct fairly quickly a prefix that is smaller than the code word of the subgraph to 
test. 

Repository-based pruning has the advantage that it often allows to decide very 
quickly that a subgraph has not been processed yet (for example. If a hash bin is 
empty). Together with comparing the numbers of nodes and edges, the support etc., 
this suggests that a repository-based approach is fastest for subgraphs that actually 
have to be processed. Only if these simple tests fail (as for equivalent subgraphs), we 
have to carry out the isomorphism test. 

As a consequence, we expect repository-based pruning to perform well if the 
number of subgraphs to be processed is large compared to the number of subgraphs 
to be discarded (as the repository is usually faster for the former). 



^ A technical remark: we do not only combine these numbers by summing them and com- 
puting their bitwise exclusive or, but also apply bitwise shifts of varying width in order to 
cover the full range of values of (32 bit) integer numbers. 
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Fig. 1. Experimental results on the IC93 data set, search time vs. minimum support in percent. 
Left: maximum source extensions, right: rightmost path extensions. 



5 Experiments 

In order to test our repository-based pruning experimentally, we implemented it as 
part of the MoSS program^, which is written in Java. As a test dataset (to which we 
confine ourselves here due to limits of space) we used a subset of the Index Chemicus 
from 1993. The results we obtained with different restricted extensions (maximum 
source and rightmost path, see Section 2) are shown in Figures 1 to 3. The horizontal 
axis shows the minimal support in percent. 

Figure 1 shows the execution times in seconds. The upper graph refers to canon- 
ical form pruning, the lower to repository-based pruning. The times do not dif- 
fer much, hut diverge for lower support values, reaching 15% advantage for the 
repository-based approach together with maximum source extensions. 

Figure 2 shows the numbers of subgraphs considered in the search and provides 
a basis for explanations of the observed behavior. The graphs refer (from top to bot- 
tom) to the number of generated subgraphs, the number checked for duplicates, the 
number of processed subgraphs, and the number of (discarded) duplicates (difference 
between the two preceding curves). 

Note that about half of the work is done by minimum support pruning (which 
discards all subgraphs that do not appear in the user-specified minimum number of 
database graphs), as it is responsible for the difference between the two top curves. 
The subgraphs discarded in this way may be unique or not — we need not care, since 
they do not qualify anyway. 

Canonical form or repository-based pruning only serve the purpose to get rid of 
the subgraphs between the two middle curves. That the gap between them is fairly 
small compared to their vertical location indicates the high quality of restricted ex- 
tensions: most redundancy is already removed by them and only fairly few redundant 
subgraphs still need to be detected. (Note that the gap is smaller for maximum source 
extensions, which is the main reason for the usually lower execution times achieved 
by this approach). 

^ MoSS is available for download under the Gnu Lesser (Library) General Public License at 
http://www.borgelt.net/moss.html. 
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Fig. 2. Experimental results on the IC93 data set, numbers of subgraphs used in the search. 
Left: maximum source extensions, right: rightmost path extensions. 




Fig. 3. Experimental results on the IC93 data set, performance of repository-based pruning. 
Left: maximum source extensions, right: rightmost path extensions. 



Figure 3 finally shows the performance of repository-based pruning (mainly the 
effectiveness of the hash function). All curves are the same as in the preceding fig- 
ure, with the exception of the third curve from the top, which shows the number of 
isomorphism tests. Subgraphs in the gap between this curve and the one above it 
have to be processed and are identified as such without any isomorphism test. Only 
subgraphs in the (small) gap between this curve and the bottom curve (the number of 
actual duplicates) have to be identified and discarded with the help of isomorphism 
tests. 

Note that for a perfect hash function (which maps only equivalent subgraphs to 
the same value) the two bottom curves would coincide. Note also that a canonical 
form can be seen as a perfect hash function (with a range of values that does not fit 
into an integer), since it uniquely identifies a graph. 



6 Summary 

In this paper we investigated the widely neglected possibility to avoid redundant 
search in frequent subgraph mining with a repository of already encountered sub- 
graphs. Even though it may be less elegant than the more popular approach of canon- 
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ical forms and, of course, requires additional memory for storing the subgraphs, it 
should not be dismissed too easily. If the repository is designed carefully, namely as 
a hash table with a hash function computed from local graph properties, it is highly 
competitive with a canonical form approach. In our experiments we observed exe- 
cution times that were up to 15% lower for the repository-based approach than for 
canonical form pruning, while the additional memory requirements were bearable. 
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Abstract. Watermarks in papers have been in use since 1282 in Medieval Europe. Water- 
marks can be understood much in the sense of being an ancient form of a copyright signature. 
The interest of the International Association of Paper Historians (IPH) lies specifically in the 
categorical determination of similar ancient watermark signatures. 

The highly complex structure of watermarks can be regarded as a strong and discrimina- 
tive property. Therefore we introduce edge-based features that are incorporated for retrieval 
and classification. The feature extraction method is capable of representing the global structure 
of the watermarks, as well as local perceptual groups and their connectivity. The advantage of 
the method is its invariance against changes in illumination and similarity transformations. 

The classification results have been obtained with leave-one out tests and a support vec- 
tor machine (SVM) with an intersection kernel. The best retrieval results have been received 
with the histogram intersection similarity measure. For the 14 class problem we obtain a true 
positive rate of more than 87%, that is better than any earlier attempt. 



1 Introduction 

Ancient watermarks served as a mark for the paper mill that made the sheet. Hence, 
they served as a unique identiher and as a quality label. Nowadays, scientists from 
the International Association of Paper Historians (IPH) try to identify unique wa- 
termarks in order to get known the evolution of commercial and cultural exchanges 
between cities in the Middle Ages (IHP 1998). The work is tedious since there are 
approximately 600.000 known watermarks and their number is steadily growing. 

In this paper we present a structure-based feature approach in order to automati- 
cally retrieve and classify ancient watermarks. In the following we show that struc- 
ture is a well suited feature to discriminate ancient watermarks. 

Next, we present relevant work that is followed by a section on the actual feature 
computation. In the second part of this article we show the most important results. We 
summarize our contribution with a discussion of the results and a hnal conclusion. 




238 



Gerd Brunner and Hans Burkhardt 



1.1 Related work 

To date, there have been attempts to classify and retrieve watermark images, both by 
textual- and content-based approaches. Textual approaches have been developed by 
Del Marmol (1987) and Briquet (1923). As a matter of fact, pure textual classification 
systems can be error prone. Watermark labels and or textual descriptions might be 
very old, erroneous or just not detailed enough. Therefore, more recent attempts have 
been undertaken in order to focus on the real content of watermark images. In Rauber 
et al. (1997) the authors used a 16-bin large circular histogram computed around the 
center of gravity of each watermark image. In addition, eight directional filters were 
applied to each image and used as a feature vector. The algorithms were tested on a 
small watermark database consisting of 120 images, split up into 12 different classes. 
The system achieved a probability of 86% that the first retrieved image belongs to 
the same class as the query image. A different approach was taken by the authors in 
Riley and Eakins (2002) who used three sets of various global moment features and 
three sets of component-based features. The latter set of features consists of several 
shape descriptors which are extracted from various image regions. 

In the following we will show that the structure of watermarks can be most effi- 
ciently represented by features taken from a set of straight line segments. Therefore, 
we will extract sets of segments and compute features from them on different scales. 



2 Feature extraction 

The geometric structure of watermarks is a strong descriptor. Therefore, we compute 
a hierarchy of structural features, namely global and local ones. The former ones 
depict a holistic scene representation and the latter ones take local perceptual groups 
and their connectivity into account. As mentioned earlier we represent the structure 
of the watermarks by straight line segments. In order to extract the line segments 
we have adopted the algorithms of Pope and Lowe (1994) and Kovesi (2002). In 
the first step we create an edge map with the Canny detector. Next, the algorithm 
scans through the binary edge map, where the neighborhood of every edge pixel is 
investigated in order to form line segments. The final segments serve as a ground 
truth for the further feature computation. 

Global Features 

Let L = {1, 1 / = 1,2, ...,A}, be a set of line segments obtained from a watermark im- 
age. Then, we compute geometric properties of L such as the angles of all segments 
between each other, the relative lengths of every segment and the relative Luclidean 
distance between all segment mid-points. 

In detail, the angle between two segments 1, and 1, is defined as: 




Classification and Retrieval of Ancient Watermarks 



239 



with 11-112 being the L2 — Norm. The angle is in the range of [— Jt,Jt]. The relative 
length of a segment 1, can be written as: 



len{li) 



^{x^-4)2 + (y<:-yb)2 

\j i^mwc ■^o)^ “t” iymax 



( 2 ) 



where xf , xf , yf and denote the coordinates of the segment’s begin and end points. 
The denominator is a scaling factor in respect to the longest possible line segment* 
with (xo,yo) and {xmax,ymax) as the begin and end point coordinates. The Euclidean 
distance between the mid-points pj and pf of the segments 1, and 1/ is defined as 
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with xf , xf , y‘i and >>'1 as the coordinates of the segment mid-points. The denominator 
fulfills the same scaling purpose as the one in Equation 2. Thus, the relative length 
of a segment and the relative distance between two segments is limited to the range 
[0, 1]. The relative representation ensures invariance under isotropic scaling. 

Now, that the three basic properties of a set of line segments are computed, we 
can incorporate this information into Euclidean distance matrices (EDM). An EDM 
is a two-dimensional array consisting of distances taken from a set of entities, that 
can be coordinates or points from a feature space. Thus, an EDM incorporates dis- 
tance knowledge. Eor our feature computation, EDMs are used in order to represent 
the relative geometric connectivity for a set of straight line segments. Specifically, 
we define three EDMs: one based on segment angles (see Equation 1) a second 
one based on relative segment lengths (see Equation 2) and a third one based on 
relative distances between segments (see Equation 3). The matrix of E™^ can 
be written as: 
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and each element is computed according to 
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(5) 



where the values of 0, and 0^ are in the range of [— Jt, Jt] . The angles are taken between 
the line segments i and j. E*^” and can be represented in a similar fashion. 

Next, we compute three histograms from the previously created EDMs. The his- 
tograming step is necessary since the size of the EDMs can differ, i.e. the number of 
line segments is not the same for each watermark.ming step is necessary since the 
size of EDMs can differ, i.e. the number of line segments is not the same for each 

* The longest possible line segment is as long as the diagonal of the image. 
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watermark. The three histograms can be understood as a holistic representation of a 
set of segments. The final concatenation of the three histograms resembles a global 
feature and is invariant against similarity transformations. 

Local features 

The previously developed global features encode a complete watermark. However, 
local structural information plays an important role, too. Watermarks commonly ex- 
hibit certain local regularities in their structure. In order to tackle this problem we 
introduce local features that are based on perceptual groups of line segments. 

Therefore, we define subsets of line segments from every watermark which are 
unique, eminent structural entities with well defined relations: Parallelity, Perpen- 
dicularity, Diagonality (",^). These groups are formed according to angular re- 
lations between segments and will be used in order to compute geometric relations 
between their members. 

The four subsets reflect line segments with certain relations. In fact, we will 
extract similar features as we did in the global case. Following that methodology, we 
can compute three EDMs: E*”®, and Ef'*^ for each of the four extracted sets of 
segments. Note that the * is a placeholder for the four sets. Specifically, we define the 
angles between two segments, the relative segment lengths and the relative distance 
between two segments according to Equations 1, 2 and 3 for every subset of line 
segments. 

Then we create three histograms for every subset of line segments. The his- 
tograms represent geometric relations of perceptual segment subsets. Since three 
histograms have been formed for every set, we obtain 12 histograms in total. The 
final set of local feature vectors is obtained by concatenation of all 12 histograms. 

Feature representation 

In our experiments we have empirically determined the best resolution for the his- 
tograms. Eor the angle based histograms^ we have incorporated 36 bins, that corre- 
sponds to a 10° resolution with respect to angles. The resolutions for every length 
based histogram^ is 15 bins, which results in a robust and compact feature. The fi- 
nal feature vector is obtained by the concatenation of all global and local feature 
histograms. 



3 Results 

3.1 Data description 

The Swiss Paper Museum in Basel provided us a subset of their digital watermark 
database. The database used in the subsequent experiments consists of about 1800 

^ Histograms that are computed from the following EDMs: E'*"® (global features) and 
(local features). 

^ Histograms that are computed from the following EDMs: E^™, (global features) and 
E(f", Ef'" (local features). 




Classification and Retrieval of Ancient Watermarks 



241 



images, split up into 14 classes. : Eagle, Anchorl, Anchor!, Coat of Arm, Circle, Bell, 
Heart, Column, Hand, Sun, Bull Head, Flower, Cup and Other objects. The class 
memberships are according to the Briquet catalog (Briquet 1923). Figure 1 shows 
scanned sample watermark images. A detailed description of the scanning setup can 
be found in Rauber (1998). In fact, the watermarks are digitized from the original 
sources. Specifically, each ancient document was scanned three times (front, back 
and by transparency) in order to obtain a high quality digital copy, where the last 
scan contains all necessary information (Rauber 1998). A semi-automatic method, 
that is describe in (Rauber 1998), delivers the, final images. The method incorporates 
a global contrast, contour enhancement and grey-level inversion. Figure 2 shows 
sample images after the method was applied. 




Fig. 1. Samples of scanned ancient watermark images (courtesy Swiss Paper Museum, Basel). 



3.2 Ancient Watermark Retrieval 

For retrieval we have computed the features offline for all watermarks. At retrieval 
time, only the feature vector for the query watermark has to be computed. The re- 
trieval results are obtained with the histogram intersection similarity measure. 

Figure 3 shows a set of 10 watermark images. The first image is the query, the 
second one is the identical match, indicated by the 1 above the image. The subse- 
quent images are sorted in decreasing similarity, as it is indicated by the numbers 
above each image. It is interesting to observe that most of the retrieved anchors show 
the same orientation. A closer look at the query image reveals that it is featured with 
a tiny cross atop and with cusp-like structures at the outer endings^. The retrieved 
images clearly show that both of these small scale structures are present in all of 
the displayed images. In Figure 4 we can see another retrieval result. Table 1 shows 
the averaged class-wise precision and recall at A/2, where N is the number of class 



^ Note, that the class Anchor! possesses a large intra-class variation of shapes, i.e. many 
anchors have no crosses or show very different endings. 
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Fig. 2. Sample filigrees from the watermark database after enhancement and binarization (see 
Rauber 1998). Each of the two rows shows watermarks from the same class, namely Heart 
and Eagle. The samples show the large intra-class variability of the watermark database. 
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Fig. 3. Retrieval result obtained with our structure-based features from the class Anchor 1 of 
the watermark database. 



Table 1. Averaged precision and recall at A/2 for the watermark database. 



Classes 


1 


2 


3 


4 


5 


6 


7 


8 


9 


10 


11 


12 


13 


14 


N 


322 


115 


139 


71 


91 


44 


197 


126 


99 


33 


14 


31 


17 


416 


P{N/2) 


.492 


.243 


.214 


.144 


.109 


.244 


.173 


.097 


.442 


.068 


.190 


.802 


.556 


.283 


R(N/2) 


.528 


.139 


.302 


.197 


.088 


.182 


.152 


.191 


.263 


.061 


.143 


.710 


.352 


.361 
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Query 



0.9489 



0.93B272 0.927625 








0.92559 




0.924098 0.916953 



0.916851 



0.910131 






Fig. 4. Retrieval result of the class Circle from the watermark database, under the usage of 
global and local structural features. 



members. Due to place limitations the watermark classes have been assigned a num- 
ber^, where one refers to the class Eagle and 14 to the class Other objects. However, 
we do observe some classes of worse performance. That is to a large extent due to the 
high intra-class variation of the database. Figure 2 shows the large intra-class vari- 
ation for two sample classes. Since CBIR performs a similarity ranking some class 
members can be less similar to a certain query (from the same class) then images 
from other classes. Visual inspections have shown that this argumentation holds for 
the classes Eagle and Coat of Arm. The reason is that eagle motives are very com- 
mon in heraldry, i.e. about half of the members of the class Coat of Arms have some 
kind of eagle embedded on a shield or armorial bearings. Similar observations hold 
for some other classes. 

3.3 Ancient Watermark Classification 

In the previous section we have retrieved watermark images. Now we want to learn 
the feature distribution of every class in the feature space. Therefore, the classifi- 
cation of the watermark images is treated as a learning problem. The classification 
results are obtained with leave-one out tests and SVMs under the usage of different 
kernel. Specifically, we have obtained the best results with the intersection kernel and 
a cost parameter C = 2^^. We have used the same features as for the retrieval task. 
The feature vectors have been normalized according to zero mean and unit variance. 
Table 2 shows the class-wise true and false positive rates which have been obtained 



Table 2. Class-wise true positive (TP) and false positive (FP) rates for the watermark 
database. 



Classes 


1 


2 


3 


4 


5 


6 


7 


8 


9 


10 


11 


12 


13 


14 


Total 


TP 


.919 


.870 


.871 


.465 


.758 


.773 


.817 


.865 


.919 


.546 


.571 


1.00 


.824 


.995 


.874 


FP 


.037 


.001 


.019 


.012 


.011 


.003 


.025 


.008 


.002 


.004 


.001 


0 


0 


.008 


.125 



^ The class names are listed in Section 3.1. 
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with a leave-one-out test. We can see that for most of the classes a high recognition 
rate is achieved. In total, a 87.41% true positive rate is achieved. 



4 Conclusion 

The retrieval and classification of watermark images is of great importance for paper 
historians. Therefore we have developed a structure-based feature extraction method 
that encodes relative spatial arrangements of line segments. The method determines 
relations on global and local scales. The results show that structure is a powerful de- 
scriptor for the current problem. The retrieval results show that the proposed features 
work very well. 

Next, we have performed a classification of the watermark images. A support 
vector machine with intersection kernel was able to successfully learn the character- 
istics of every class. A classification rate (true positive rate) of more than 87% is an 
indicator of a good performance. In future work, we would like to apply the struc- 
tural features to a larger database of watermarks and Investigate partial matching as 
well. 
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Abstract. Supervised classification methods require reliable and consistent training sets. In 
image analysis, where class labels are often assigned to the entire image, the manual genera- 
tion of pixel-accurate class labels is tedious and time consuming. We present an independent 
component analysis (ICA)-based method to generate these pixel-accurate class labels with 
minimal user interaction. The algorithm is applied to the detection of skin cancer in hyper- 
spectral images. Using this approach it is possible to remove artifacts caused by sub-optimal 
image acquisition. We report on the classification results obtained for the hyper- spectral skin 
cancer data set with 300 images using support vector machines (SVM) and model-based dis- 
criminant analysis (MclustDA, MDA). 



1 Introduction 

Hyper-spectral images consist of several, up to hundred, images acquired at different 
- mostly narrow band and contiguous - wavelengths. Thus, a hyper-spectral image 
contains pixels represented as multidimensional vectors with elements indicating the 
reflectivity at a specific wavelength. For a contiguous set of narrow band wavelengths 
these vectors correspond to spectra in the physical meaning and are equal to spectra 
measured with e.g. spectrometers. 

Supervised classification of hyper-spectral images requires a reliable and consistent 
training set. In many applications labels are assigned to the full image instead of to 
each individual pixel even if instances of all the classes occur in the image. To obtain 
a reliable training set it may be necessary to label the images on a pixel by pixel basis. 
Manually generating pixel-accurate class labels requires a lot of effort; cluster-based 
automatic segmentation is often sensitive to measurement errors and illumination 
problems. In the following we present a labelling strategy for hyper- spectral skin 
cancer data that uses PCA, ICA and K-Means clustering. For the classification of 
unknown images, we compare support vector machines and model-based discrimi- 
nant analysis. 
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Section 2 describes the methods that are used for the labelling approach. The classi- 
fication algorithms are discussed in Section 3. In Section 4 we present the segmen- 
tation and classification results obtained for the skin cancer data set and Section 5 is 
devoted to discussions and conclusions. 



2 Labelling 

Hyper- spectral data are highly correlated and contain noise which adversely affects 
classification and clustering algorithms. As the dimensionality of the data equals the 
number of spectral bands, using the full spectral information leads to computational 
complexity. To overcome the curse of dimensionality we use PCA to reduce the di- 
mensions of the data, and inherently also unwanted noise. Since different features of 
the image may have equal score values for the same principal component, an addi- 
tional feature extraction step is proposed. ICA makes it possible to detect acquisition 
artifacts like saturated pixels and inhomogeneous illumination. Those effects can be 
significantly reduced in the spectral information giving rise to an improved segmen- 
tation. 

2.1 Principal Component Analysis (PCA) 

PCA is a standard method for dimension reduction and can be performed by sin- 
gular value decomposition. The algorithm gives uncorrelated principal components. 
We assume that those principal components that correspond to very low eigenvalues 
contribute only to noise. As a rule of thumb, we chose to retain at least 95% of the 
variability which led to selecting 6-12 components. 

2.2 Independent Component Analysis (ICA) 

ICA is a powerful statistical tool to determine hidden factors of multivariate data. The 
ICA model assumes that the observed data, x, can be expressed as a linear mixture 
of statistically independent components, s. The model can be written as 

x = As 

where the unknown matrix A is called the mixing matrix. Defining W as the unmixing 
matrix we can calculate s as 

s = Wx. 

As we have already done a dimension reduction, we can assume that noise is neg- 
ligible and A is square which implies W =A^^. This significantly simplifies the 
estimation of A and s. Providing that no more than one independent component has 
Gaussian distribution, the model can be uniquely estimated up to scalar multipliers. 
There exists a variety of different algorithms for fitting the ICA model. In our work 
we focused on the two most popular implementations which are based on maximisa- 
tion of non-Gaussianity and minimisation of mutual information respectively: Fas- 
tlCA and FlexICA. 
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FastICA 

The FastICA algorithm developed by Hyvarinen et al. (2002) uses negentropy, J{y), 
as a measure of Gaussianity. Since negentropy is zero for Gaussian variables and 
always nonnegative one has to maximise negentropy in order to maximise non- 
Gaussianity. To avoid computation problems the algorithm uses an approximation 
of negentropy: If G denotes a nonquadratic function and we want to estimate one 
independent component s we can approximate 

J{y)^[E{G{y)}-E{G{v)}]\ 

where v is a standardised Gaussian variable and y is an estimate of s. We adopt to use 
G(y) = log coshy since this has been shown to be a good choice. Maximisation di- 
rectly leads to a fixed-point iteration algorithm that is 20 — 50 times faster than other 
IGA implementations. To estimate several independent components a deflationary 
orthogonalisation method is used. 



FlexICA 



Mutual information is a natural measure of information that members of a set of 
random variables have on the others. Choi et al. (2000) proposed an ICA algorithm 
that attempts to minimise this quantity. All independent components are estimated 
simultaneously using a natural gradient learning rule with the assumption that the 
source signals have the generalized Gaussian distribution with density 



m (ji) 



n 

2o,-T(l/r;) 



exp 





Here r,- denotes the Gaussian exponent which is chosen in a flexible way depending 
on the kurtosis of the y,. 



2.3 Two-Stage K-Means clustering 

From a statistical point of view it may be inappropriate to use K-means clustering 
since K-means cannot use all the higher order information that ICA provides. There 
are several approaches that avoid using K-means, for example Shah et al. (2005) pro- 
posed the ICA mixture model (ICAMM). However, for large images this algorithm 
fails to converge. We developed a 2-stage K-means clustering strategy that works 
particularly well with skin data. The choice of 5 resp. 3 clusters for the K-means 
algorithm has been determined empirically for the skin cancer data set. 

1. Drop ICs that contain a high amount of noise or correspond to artifacts. 

2. Perform K-means clustering with 5 clusters. 

3. Those clusters that correspond to healthy skin are taken together into one cluster. 
This cluster is labelled as skin. 

4. Perform a second run of K-means clustering on the remaining clusters (inflamed 
skin, lesion, etc.). This time use 3 clusters. Label the clusters that correspond to 
the mole and melanoma centre as mole and melanoma. The remaining clusters 
are considered to be ‘regions of uncertainty’. 
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3 Classification 

This section describes the classification methods that have been investigated. The 
preprocessing steps for the training data are the same as in the segmentation task: 
Dimension reduction using PCA and feature extraction performed by ICA. Using 
the Bayesian Information Criterion (BIC), the data were reduced to 6 dimensions. 

3.1 Mixture Discriminant Analysis (MDA) 

MDA assumes that each class j can be modelled as a mixture of Rj subclasses. 
The subclasses have a multivariate Gaussian distribution with mean vector py,., r = 
and covariance matrix X, which is the same for all classes. Hence, the 
mixture model for class j has the density 

nij (x) = |2nX| 2 ^Jij.exp | ^ | , 

where denote the mixing probabilities for the 7 -th subclass, = 1 - The 

parameters 0 = can be estimated using an EM-algorithm or, as Hastie et 

al. (2001) suggest, using optimal scoring. It is also possible to use flexible discrim- 
inant analysis (FDA) or penalized discriminant analysis (PDA) in combination with 
MDA. The major drawback of this classification approach is that, similar to LDA 
which is also described in Hastie et al. (2001), the covariance matrix is fixed for all 
classes and the number of subclasses for each class has to be set in advance. 

3.2 Model-based Discriminant Analysis (MclustDA) 

MclustDA, proposed by Fraley et al. (2002), extends MDA in a way that the covari- 
ance in each class is parameterized using the eigenvalue decomposition 

'Lr = KDrArDj , r=l,...,Rj. 

The volume of the component is controlled by Xr, Ar defines the shape and Dr is 
responsible for the orientation. The model selection is done using the BIC and the 
maximum likelihood estimation is performed by an EM-algorithm. 

3.3 Support Vector Machines (SVM) 

The aim of support vector machines is to find a hyperplane that optimally separates 
two classes in a high-dimensional feature space induced by a Mercer kernel K (x,z). 
In the L^-norm case the Lagrangian dual problem is to find X* that solves the follow- 
ing convex optimization problem: 

m^mm z l\^ 

max'^Xi- -'^'^XiXjytyj (xi,Xj) + -5ij] s.t.^X,y; = 0, X,- > 0, 
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where x,- are training points belonging to classes y,. The cost parameter C and the 
kernel function have to be chosen to suit to the problem. It is also possible to use 
different cost parameters for unbalanced data as was suggested by Veropoulos et al. 
(1999). 

Although SVMs were originally designed as binary classifiers, there exists a vari- 
ety of methods to extend them to k > 2 classes. In our work we focused on one- 
against-all and one-against-one SVMs. The one-against-all formulation trains each 
class against all remaining classes resulting in k binary SVMs. The one-against-one 
formulation uses ^ 2 SVMs, each separating one class from one another. 



4 Results 

A set of 310 hyper- spectral images (512 x 512 pixels and 300 spectral bands) of 
malign and benign lesions were taken in clinical studies at the Medical University 
Graz, Austria. They are classified as melanoma or mole by human experts on the 
basis of a histological examination. However, in our survey we distinguish between 
three classes, melanoma, mole and skin, since all these classes typically occur in the 
images. The segmentation task is especially difficult in this application: We have 
to take into account that melanoma typically occurs in combination with mole. To 
reduce the number of outliers in the training set we define a ‘region of uncertainty’ 
as a transition region between the kernels of mole and melanoma and between the 
lesion and the skin. 

4.1 Training 

Figures 1(b) and 1(c) display the first step of the K-Means strategy described in Sec- 
tion 2.3. The original image displayed in Figure 1(a) shows a mole that is located 
in the middle of a hand. For PCA-transformed data, as in Figure 1(b), the algorithm 
performs poorly and the classes do not correspond to lesion, mole and skin regions 
(left and bottom). Even the lesion is in the same class together with an illumination 
problem. If the data is also transformed using ICA, as in Figure 1(c), the lesion is 
already identified and there exists a second class in the form of a ring around the 
lesion which is the desired ‘region of uncertainty’. The other classes correspond to 
wrinkles on the hand. 

Figure 1(d) shows the second K-Means step for the PC A transformed data. Although 
the second K-Means step makes it possible to separate the lesion from the illumina- 
tion problem it can be seen that the class that should correspond to the kernel of the 
mole is too large. Instances from other classes are present in the kernel. The second 
K-Means step with the ICA preprocessed data is shown in Figure 1(e). Not only the 
kernel is reliably detected but there also exists a transition region consisting of two 
classes. One class contains the border of the lesion. The second class separates the 
kernel from the remaining part of the mole. 

We believe that the FastICA algorithm is the most appropriate ICA implementation 
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(d) (e) 



Fig. 1. The two iteration steps of the K-Means approach for both PCA ((b) and (d)) and 
ICA ((c) and (e)) are displayed together with the original image (a). The different gray levels 
indicate the cluster the pixel has been assigned to. 



for this segmentation task. The segmentation quality for both methods is very simi- 
lar, however the FastICA algorithm is faster and more stable. 

To generate a training set of 12.000 pixel spectra per class we labelled 60 mole im- 
ages and 17 melanoma images using our labelling approach. The pixels in the train- 
ing set are chosen randomly from the segmented images. 

4.2 Classification 

In Table 1 we present the classification results obtained for the different classifiers 
described in Section 3. As a test set we use 57 melanoma and 253 mole images. We 
use the output of the LDA classifier as a benchmark. 

LDA turns out to be the worst classifier for the recognition of moles. Nearly one half 
of the mole images are misclassified as melanoma. On the other hand LDA yields 
excellent results for the classification of melanoma, giving rise to the presumption 
that there is a large bias towards the melanoma class. With MDA we use three sub- 
classes in each class. Although both MDA and LDA keep the covariance fixed, MDA 
models the data as mixture of Gaussians leading to a significantly higher recognition 
rate compared to LDA. Using FDA or PDA in combination with MDA does not im- 
prove the results. MclustDA performs best among these classifiers. Notice however, 
that BIC overestimates the number of subclasses in each class which is between 14 
and 21. For all classes the model with varying shape, varying volume and varying 
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Table 1. Recognition rates obtained for the different classifiers 



Pre-Proc. 


Class 


MDA 


MclustDA 


LDA 


FlexICA 


Mole 


84.5% 


86.5% 


56.1% 


Melanoma 


89.4% 


89.4% 


98.2% 


FastICA 


Mole 


84.5% 


87.7% 


56.1% 


Melanoma 


89.4% 


89.4% 


98.2% 


Pre-Proc. 


Class 


OAA-SVM 


OAO-SVM 


unbalanced SVM 


FlexICA 


Mole 


72.7% 


69.9% 


87.7% 


Melanoma 


92.9% 


94.7% 


89.4% 


FastICA 


Mole 


71.5% 


69.9% 


87.3% 


Melanoma 


92.9% 


94.7% 


89.4% 



orientation of the mixture components is chosen. This extra flexibility makes it pos- 
sible to outperform MDA even though only half of the training points could be used 
due to memory limitations. Another significant advantage of MclustDA is its speed, 
taking around 20 seconds for a full image. 

Since misclassiflcation of melanoma into the mole class is less favourable than mis- 
classiflcation of mole into the melanoma class, we clearly have unbalanced data 
in the skin cancer problem. According to Veropoulos et al. (1999) we can choose 
Cmeianoma > Ciio/e = Cskin- We obtain the best results using the polynomial kernel of 
degree three with Cmeianoma = 0.5 and Cmoie = Cskin = 0.1. This method is clearly 
superior when compared with the other SVM approaches. For the one-against-all 
(OAA-SVM) and the one-against-one (OAO-SVM) formulation we use Gaussian 
kernels with C = 2 and c = 20. A drawback of all the SVM classifiers, however, is 
that training takes 20 hours (Centrino Duo 2.17GFIz, 2GB RAM) and classification 
of a full image takes more than 2 minutes. 

We discovered that different IGA implementations have no significant impact on the 
quality of the classification output. FlexICA performs slightly better for the unbal- 
anced SVM and one-against-all-SVM. FastICA gives better results for MclustDA. 
For all other classifiers the performances are equal. 



5 Conclusion 

The combination of PCA and ICA makes it possible to detect both artifacts and the 
lesion in hyper-spectral skin cancer data. The algorithm projects the correspond- 
ing features on different independent components; dropping the independent com- 
ponents that correspond to the artifacts and applying a 2-stage K-Means clustering 
leads to a reliable segmentation of the images. It is interesting to note that for the 
mole images in our study there is always one single independent component that 
carries the information about the whole lesion. This suggests very simple segmen- 
tation in the case where the skin is healthy: keep the single independent component 
that contains the desired information and perform the K-Means steps. For melanoma 





252 



Hannes Kazianka, Raimund Leitner and Jurgen Pilz 



images the spectral information about the lesion is contained in at least two inde- 
pendent components, leading to reliable separation of the melanoma kernel from the 
mole kernel. 

Unbalanced SVM and MclustDA yield equally good classification results, however, 
because of its computational performance MclustDA is the best classiher for the skin 
cancer data in terms of overall accuracy. 

The presented segmentation and classification approach does not use any spatial in- 
formation. In future research Markov random fields and contextual classifiers could 
be used to take into account the spatial context. 

In a possible application, where the physician is assisted by system which pre-screens 
patients, we have to take care of high sensitivity which is typically accompanied with 
a loss in specificity. Preliminary experiments showed that a sensitivity of 95% is pos- 
sible at the cost of 20% false-positives. 



References 

ABE, S. (2005): Support Vector Machines for Pattern Classification. Springer, London. 

CHOI, S., CICHOCKI, A. and AMARI, S. (2000): Flexible Independent Component Analysis. 
Journal of VLSI Signal Processing, 26(1/2), 25-38. 

FRALEY, C. and RAFTERY, A. (2002): Model-Based Clustering, Discriminant Analysis, and 
Density Estimation. Journal of the American Statistical Association, 97, 611-631. 

HASTIE, T., TIBSHIRANI, R. and FRIEDMAN, J. (2001): The Elements of Statistical Learn- 
ing. Springer, New York. 

HYVARINEN, A., KARHUNEN, J. and OJA, E. (2001): Independent Component Analysis. 
Wiley, New York. 

SHAH, C., ARORA, M. and VARSHNEY, R (2004): Unsupervised classification of hyper- 
spectral data: an ICA mixture model based approach. International Journal of Remote 
Sensing, 25, 481^87. 

VEROPOULOS, K., CAMPBELL, C. and CRISTIANI, N. (1999): Controlling the Sensitivity 
of Support Vector Machines. Proceedings of the Sixteenth International Joint Conference 
on Artificatial Intelligence, Workshop ML3, 55-60. 




FSMTree: An Efficient Algorithm for Mining 
Frequent Temporal Patterns 



Steffen Kempe', Jochen Hipp' and Rudolf Kruse^ 

' DaimlerChrysler AG, Group Research, 89081 Ulm, Germany 
{Steffen . Kempe , Jochen . Hipp } gdaimlerchrysler . com 
^ Dept, of Knowledge Processing and Language Engineering, 

University of Magdeburg, 39106 Magdeburg, Germany 
Kruse@iws . cs . uni -magdeburg . de 

Abstract. Research in the field of knowledge discovery from temporal data recently focused 
on a new type of data: interval sequences. In contrast to event sequences interval sequences 
contain labeled events with a temporal extension. Mining frequent temporal patterns from 
interval sequences proved to be a valuable tool for generating knowledge in the automotive 
business. In this paper we propose a new algorithm for mining frequent temporal patterns from 
interval sequences: FSMTree. FSMTree uses a prefix tree data structure to efficiently organize 
all finite state machines and therefore dramatically reduces execution times. We demonstrate 
the algorithm’s performance on field data from the automotive business. 



1 Introduction 

Mining sequences from temporal data is a well known data mining task which gained 
much attention in the past (e.g. Agrawal and Srikant (1995), Mannila et al. (1997), 
or Pei et al. (2001)). In all these approaches, the temporal data is considered to con- 
sist of events. Each event has a label and a timestamp. In the following, however, 
we focus on temporal data where an event has a temporal extension. These tempo- 
rally extended events are called temporal intervals. Each temporal interval can be 
described by a triplet (b,e,l) where b and e denote the beginning and the end of the 
interval and I its label. 

At DaimlerChrysler we are interested in mining interval sequences in order to 
further extend the knowledge about our products. Thus, in our domain one interval 
sequence may describe the history of one vehicle. The configuration of a vehicle, e.g. 
whether It is an estate car or a limousine, can be described by temporal intervals. The 
build date is the beginning and the current day is the end of such a temporal Interval. 
Other temporal intervals may describe stopovers in a garage or the installation of 
additional equipment. Hence, mining these interval sequences might help us In tasks 
like quality monitoring or improving customer satisfaction. 
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2 Foundations and related work 

As mentioned above we represent a temporal interval as a triplet {b, e,l). 

Definition 1. (Temporal Interval) Given a set of labels I G L, we say the triplet 
(b,e,l) e K xM. X L is a temporal interval, if b < e. The set of all temporal inter- 
vals over L is denoted by I. 

Definition 2. (Interval Sequence) Given a sequence of temporal intervals, we say 
{bi,ei,li),(b 2 ,e 2 ,l 2 },- ■ ■ ,{bn,en,ln) & I is an interval sequence, if 

y(bi,ei,li)fbj,ej,lj) € fif^ j ■ bjAei ^ bj ^ // Ij (1) 

y{bi,ei,li),{bj,ej,lj) Gl,i<j: 

(bi < bj) V (bi = bjAei< ef) V (bi = bj A e,- = ej A /,• < If) 
hold. A given set of interval sequences is denoted by S. 

Equation 1 above is referred to as the maximality assumption (Hoppner (2002)). 
The maximality assumption guarantees that each temporal interval A is maximal, 
in the sense that there is no other temporal interval in the sequence sharing a time 
with A and carrying the same label. Equation 2 requires that an interval sequence 
has to be ordered by the beginning (primary), end (secondary) and label (tertiary, 
lexicographically) of its temporal intervals. 

Without temporal extension there are only two possible relations. One event is 
before (or after as the inverse relation) the other or they coincide. Due to the tem- 
poral extension of temporal intervals the possible relations between two intervals 
become more complex. There are 7 possible relations (or 13 if one includes inverse 
relations). These interval relations have been described in Allen (1983) and are de- 
picted in Eigure 1 . Each relation of Eigure 1 is a temporal pattern on its own that 
consists of two temporal intervals. Patterns with more than two temporal intervals 
are straightforward. One just needs to know which interval relation exists between 
each pair of labels. Using the set of Allen’s interval relations I, a temporal pattern is 
defined by: 

Definition 3. (Temporal Pattern) A pair P = (s,R), where s : 1, L and R € 

jnxn, „ g ig called a “temporal pattern of size n” or “n-pattem". 



Relation A to B 

A before B 
A meets B 

A overlaps B 
A is-fmished-by B 



aZD 




Inverse Relation B to A 

I £ j B after A 
1 B is-met-by A 

g I B is-overlapped-by A 
B finishes A 



Relation A to B 

A contains B 
A is-started-by B 

A equals B 




Inverse Relation B to A 

B during A 
B starts A 

B equals A 



Fig. 1. Allen’s Interval Relations 
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Fig. 2. a) Example of an interval sequence: (1,4,A), (3,7 ,B), (7, 10, A) b) Example of a temporal 
pattern (e stands for equals, o for overlaps, b for before, m for meets, io for is-overlapped-by, 
etc.) 



Figure 2. a shows an example of an interval sequence. The corresponding tempo- 
ral pattern is given in Figure 2.b. 

Note that a temporal pattern is not necessarily valid in the sense that it must be 
possible to construct an interval sequence for which the pattern holds true. On the 
other hand, if a temporal pattern holds true for an interval sequence we consider this 
sequence as an instance of the pattern. 

Definition 4. (Instance) An interval sequence S = {bi,ei,li)\<i<n conforms to a n- 
pattern P = (s,R), : s{i) = li/\s{j) = ljAR[i,f\ = ir( [£>,-, e,], [bj,ej\) with func- 

tion ir returning the relation between two given intervals. We say that the interval 
sequence S is an instance of temporal pattern P. We say that an interval sequence S' 
contains an instance ofP ifS C S' , i.e. S is a subsequence of S'. 

Obviously a temporal pattern can only be valid if its labels have the same order as 
their corresponding temporal intervals have in an instance of the pattern. Next, we 
define the support of a temporal pattern. 

Definition 5. (Minimal Occurrence) For a given interval sequence S a time interval 
(time window) [b,e\ is called a minimal occurrence of the k-pattern P (A: > 2), if(l.) 
the time interval [b,e] of S contains an instance of P, and (2.) there is no proper 
subinterval [b' ,e'\ of[b,e\ which also contains an instance o/P. For a given interval 
sequence S a time interval [b,e\ is called a minimal occurrence of the l-pattem P, if 
(1.) the temporal interval fb,e,l) is contained in S, and (2.) I is the label in P. 

Definition 6. ( Support) The support of a temporal pattern Pfor a given set of interval 
sequences S is given by the number of minimal occurrences of P in S.' Supg{P) = 
|{ [&, e] : [b, e] is a minimal occurrence ofP in S AS € S}|. 

As an illustration consider the pattern A before A in the example of Figure 2. a. The 
time window [1,11] is not a minimal occurrence as the pattern is also visible e.g. in 
its subwindow [2,9]. Also the time window [5, 8] is not a minimal occurrence. It does 
not contain an instance of the pattern. The only minimal occurrence is [4,7] as the 
end of the first and the beginning of the second A are just inside the time window. 

The mining task is to find all temporal patterns in a set of interval sequences 
which satisfy a defined minimum support threshold. Note that this task is closely 
related to frequent itemset mining, e.g. Agrawal et al. (1993). 

Previous investigations on discovering frequent patterns from sequences of tem- 
poral intervals include the work of Hoppner (2002), Kam and Fu (2000), Papapetrou 
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et al. (2005), and Winarko and Roddick (2005). These approaches can be divided 
into two different groups. The main difference between both groups is the definition 
of support. Hoppner defines the temporal support of a pattern. It can be interpreted 
as the probability to see an instance of the pattern within the time window if the time 
window is randomly placed on the interval sequence. All other approaches count the 
number of instances for each pattern. The pattern counter is incremented once for 
each sequence that contains the pattern. If an interval sequence contains multiple 
instances of a pattern then these additional instances will not further increment the 
counter. 

For our application neither of the support definitions turned out to be satisfying. 
Hoppner’s temporal support of a pattern is hard to interpret in our domain, as it 
is generally not related to the number of instances of this pattern in the data. Also 
neglecting multiple instances of a pattern within one interval sequence is inapplicable 
when mining the repair history of vehicles. Therefore we extended the approach 
of minimal occurrences in Mannila (1997) to the demands of temporal intervals. 
In contrast to previous approaches, our support definition allows (1.) to count the 
number of pattern instances, (2.) to handle multiple instances of a pattern within one 
interval sequence, and (3.) to apply time constraints on a pattern instance. 



3 Algorithms FSMSet and FSMTree 

In Kempe and Hipp (2006) we presented FSMSet, an algorithm to find all frequent 
patterns within a set of interval sequences S. The main idea is to generate all frequent 
temporal patterns by applying the Apriori scheme of candidate generation and sup- 
port evaluation. Therefore FSMSet consists of two steps: generation of candidate sets 
and support evaluation of these candidates. These two steps are alternately repeated 
until no more candidates are generated. The Apriori scheme starts with the frequent 
1 -patterns and then successively derives all k-candidates from the set of all frequent 
(k-l)-patterns. 

In this paper we will focus on the support evaluation of the candidate patterns, as 
it is the most time consuming part of the algorithm. FSMSet uses hnite state machines 
which subsequently take the temporal intervals of an interval sequence as input to 
hnd all instances of a candidate pattern. 

It is straightforward to derive a hnite state machine from a temporal pattern. 
For each label in the temporal pattern a state is generated. The hnite state machine 
starts in an initial state. The next state is reached if we input a temporal interval that 
contains the same label as the hrst label of the temporal pattern. From now on the 
next states can only be reached if the shown temporal interval carries the same label 
as the state and its interval relation to all previously accepted temporal intervals is 
the same as specihed in the temporal pattern. If the hnite state machine reaches its 
last state it also reaches its hnal accepting state. Consequently the temporal intervals 
that have been accepted by the state machine are an instance of the temporal pattern. 

The minimal time window in which this pattern instance is visible can be derived 
from the temporal intervals which have been accepted by the state machine. We 
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Fig. 3. a) - d) four candidate patterns of size 3 e) an interval sequence 



Table 1. Set of state machines of FSMSet for the example of Figure 3. Each column shows the 
new state machines that have been added by FSMSet. 





1 


2 


3 


4 


5 


6 




5«(i) 


Sa{2) 


Sc{3) 


Sc(3,4) 


Sa(5) 


Sa(l,3,6) 


SbO 

ScO 

SdO 


SbW 


Sb{2) 


Sd{3) 
Sail, 3) 
Sh{2,3) 


Sd(3,4) 


Sb(5) 

Se(3,4,5) 


Sb(2,3,6) 



know that the time window contains an instance but we do not know whether it is 
a minimal occurrence. Therefore FSMSet applies a two step approach. First it will 
find all instances of a pattern using state machines. Then it prunes all time windows 
which are not minimal occurrences. 

To find all instances of a pattern in an interval sequence FSMSet is maintaining 
a set of finite state machines. At first, the set only contains the state machine that 
is derived from the candidate pattern. Subsequently, each temporal interval from the 
interval sequence is shown to every state machine in the set. If a state machine can 
accept the temporal interval, a copy of the state machine is added to the set. The 
temporal interval is shown only to one of these two state machines. Hence, there will 
always be a copy of the initial state machine in the set trying to find a new instance 
of the pattern. In this way FSMSet also can handle situations in which single state 
machines do not suffice. Consider the pattern A meets B and the interval sequence 
(1, 2, A), (3, 4, A), (4, 5, B). Without using look ahead a single finite state machine 
would accept the first temporal interval (1,2, A). This state machine is stuck as it 
cannot reach its final state because there is no temporal interval which is-met-by 
(1,2, A). Hence the pattern instance (3, 4, A), (4, 5, B) could not be found by a single 
state machine. Here this is not a problem because there is a copy of the first state 
machine which will find the pattern instance. 

Figure 3 and Table 1 give an example of FSMSet’s support evaluation. There are 
four candidate patterns (Figure 3. a - 3. d) for which the support has to be evaluated 
on the given interval sequence in Figure 3.e. 

At first, a state machine is derived for each candidate pattern. The first column 
in Table 1 corresponds to this initialization (state machines Sa - Sd). Afterwards 
each temporal interval of the sequence is used as input for the state machines. The 
first temporal interval has label A and can only be accepted by the state machines 
Sa() and Sb{). Thus the new state machines 5'a(l) and 5';,(1) are added. The numbers 
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in brackets refer to the temporal intervals of the interval sequence that have been 
accepted by the state machine. The second temporal interval carries again the label 
A and can only be accepted by Sa{) and S*(). The third temporal interval has label B 
and can be accepted by iS'eO and iS'^O. It also stands to the first A in the relation after 
and to the second A in the relation is-overlapped-by. Hence also the state machines 
5a(l) and S*(2) can accept this interval. Table 1 shows all new state machines for 
each temporal interval of the interval sequence. For this example the approach of 
FSMSet needs 19 state machines to find all three instances of the candidate patterns. 

A closer examination of the state machines in Table 1 reveals that many state 
machines show a similar behavior. E.g. both state machines Sc and Sd accept ex- 
actly the same temporal intervals until the fourth iteration of FSMSet. Only the fifth 
temporal interval cannot be accepted by Sd- The reason is that both state machines 
share the common subpattern B overlaps C as their first part (i.e. a common prefix 
pattern). Only after this prefix pattern is processed their behavior can differ. Thus we 
can minimize the algorithmic costs of FSMSet by combining all state machines that 
share a common prefix. Combining all state machines of Figure 3 in a single data 
structure leads to the prefix tree in Figure 4. Each path of the tree is a state machine. 
But now different state machines can share states, if their candidate patterns share a 
common pattern prefix. By using the new data structure we derive a new algorithm 
for the support evaluation of candidate patterns — FSMTree. 

Instead of maintaining a list of state machines FSMTree maintains a list of nodes 
from the prefix tree. In the first step the list only contains the root node of the tree. Af- 
terwards all temporal intervals of the interval sequence are processed subsequently. 
Each time a node of the set can accept the current temporal interval its corresponding 
child node is added to the set. Table 2 shows the new nodes that are added in each 
step if we apply the prefix tree of Figure 4 to the example of Figure 3. Obviously the 
algorithmic overhead is reduced significantly. Instead of 19 state machines FSMTree 
only needs 11 nodes to find all pattern instances. 




Fig. 4. FSMTree: prefix tree of state machines based on the candidates of Figure 3 
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Table 2. Set of nodes of FSMTree for the example of Figure 3. Each column gives the new 
nodes that have been added by FSMTree. 
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Fig. 5. Runtimes of FSMSet and FSMTree for different support thresholds. 



4 Performance evaluation and conclusions 

In order to evaluate the performance of FSMTree in a real application scenario we 
employed a dataset from our domain. This dataset contains information about the 
history of 101250 vehicles. There is one sequence for each vehicle. Each sequence 
comprises between 14 and 48 temporal intervals. In total, there are 345 different 
labels and about 1.4 million temporal intervals in the dataset. 

We performed 5 different experiments varying the minimum support threshold 
from 3 200 down to 200. For each experiment we measured the runtimes of FSMSet 
and FSMTree. The algorithms are implemented in Java and all experiments were 
carried out on a SUN Fire X2100 running at 2.2 GHz. 

Figure 5 shows that FSMTree clearly outperforms FSMSet. In the first experiment 
FSMTree reduced the runtime from 36 to 5 minutes. The difference between FSMSet 
and FSMTree even grows as the minimum support threshold gets lower. For the last 
experiment FSMSet needed two days while it took FSMTree only 81 minutes. The 
reason for FSMTree’ s huge runtime advantage at low support threshold is that as the 
support threshold decreases the number of frequent patterns increases. Consequently 
the number of candidate patterns increases too. The number of candidates is the 
same for FSMSet and FSMTree but FSMTree combines all patterns with common 
prefix patterns. If there are more candidate patterns the chance for common prefixes 
increases. Therefore FSMTree’ & ability to reduce the runtime will increase (compared 
to FSMSet) as the support threshold gets lower. 

In this paper we presented FSMTree: a new algorithm for mining frequent tem- 
poral patterns from interval sequences. FSMTree is based on the Apriori approach of 
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candidate generation and support evaluation. For each candidate pattern a finite state 
machine is derived to parse the input data for instances of this pattern. FSMTree uses 
a prefixtree-like data structure to efficiently organize all finite state machines. In our 
application of mining the repair history of vehicles FSMTree was able to dramatically 
reduce execution times. 
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Abstract. We present MlRToolbox, an integrated set of functions written in Matlab, dedicated 
to the extraction from audio files of musical features related, among others, to timbre, tonality, 
rhythm or form. The objective is to offer a state of the art of computational approaches in the 
area of Music Information Retrieval (MIR). The design is based on a modular framework: the 
different algorithms are decomposed into stages, formalized using a minimal set of elementary 
mechanisms, and integrating different variants proposed by alternative approaches - including 
new strategies we have developed -, that users can select and parametrize. These functions can 
adapt to a large area of objects as input. 

This paper offers an overview of the set of features that can be extracted with MlRToolbox, 
illustrated with the description of three particular musical features. The toolbox also includes 
functions for statistical analysis, segmentation and clustering. 

One of our main motivations for the development of the toolbox is to facilitate investiga- 
tion of the relation between musical features and music-induced emotion. Preliminary results 
show that the variance in emotion ratings can be explained by a small set of acoustic features. 



1 Motivation and approach 

MlRToolbox is a Matlab toolbox dedicated to the extraction of musically-related 
features in audio recordings. It has been designed in particular with the objective of 
enabling the computation of a large range of features from databases of audio files, 
that can be applied to statistical analyses. 

We chose to base the design of the toolbox on Matlab computing environment, 
as it offers good visualisation capabilities and gives access to a large variety of other 
toolboxes. In particular, the MlRToolbox makes use of functions available in public- 
domain toolboxes such as the Auditory Toolbox (Slaney, 1998), NetLab (Nabney, 
2002), or SOMtoolbox (Vesanto, 1999). It appeared that such computational frame- 
work, because of its general objectives, could be useful to the research community in 
Music Information Retrieval (MIR), but also for teaching. For that reason, a particu- 
lar attention has been paid concerning the ease of use of the toolbox. The functions 
are called using a simple and adaptive syntax. More expert users can specify a large 
range of options and parameters. 
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The different musical features extracted from the audio files are highly interde- 
pendent: in particular, as can be seen in figure 1, some features are based on same 
initial computations. In order to improve the computational efficiency, it is impor- 
tant to avoid redundant computations of these common components. Each of these 
intermediary components, and the final musical features, are therefore considered as 
building blocks that can be freely articulated one with each other. Besides, in keeping 
with the objective of optimal ease of use of the toolbox, each building block has been 
conceived in a way that it can adapt to the type of input data. 



2 Feature extraction 

Figure 1 shows an overview of the main features considered in the toolbox. All the 
different processes start from the audio signal (on the left) and form a chain of op- 
erations developed horizontally rightwise. The vertical disposition of the processes 
indicates an increasing order of complexity of the operations, from simplest com- 
putation (top) to more detailed auditory modelling (bottom). Each musical feature 
is related to the different broad musical dimensions traditionally defined in music 
theory. In bold are highlighted features related to pitch, to tonality (chromagram, 
key strength and key Self-Organising Map, or SOM) and to dynamics (Root Mean 
Square, or RMS, energy). In bold italics are indicated features related to rhythm: 
namely tempo, pulse clarity and fluctuation. In simple italics are highlighted a large 
set of features that can be associated to timbre. Among them, all the operators in grey 
italics can be in fact applied to many others different representations: for instance, 
statistical moments such as centroid, kurtosis, etc., can be applied to either spectra, 
envelopes, but also to any histogram based on any given feature. 



Audio signal 
waveform 



- Zero-crossing rate 

■ RMS energy 

- Envelope ^ Attack/Sustain/Release 




Centroid, Kurtosis, Spread, Skewness 
Fiatness, Roil-off, Entropy, irregularity 
^Brightness, Roughness 
Spectral flux 
Mel-scale spectrum 

— ► Fluctuation 

Chromagram — ► Key strength — ►Key SOM 

Cepstrum - 
Autocorrelation - 



-Pitch 



■ Filterbank — ► Envelope ■ Autocorrelation- 

Spectral flux Spectrum 



Tempo 
Pulse clarity 



Fig. 1. Overview of the musical features that can be extracted with MIRToolbox. 
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2.1 Example: Timbre analysis 

One common way of describing timbre is based on MFCCs (Rabiner and Juang, 
1993; Slaney, 1998). MFCCs, providing a measure of spectral shape, has been found 
to be a good predictor of timbral similarity. Figure 2 shows the diagram of opera- 
tions. First, the audio sequence is described in the spectral domain, using an FFT. 
The spectrum is converted from the frequency domain to the Mel-scale domain: the 
frequencies are rearranged into 40 frequency hands called Mel-bands. The envelope 
of the Mel-scale spectrum is described through a Discrete Cosine Transform. The 
values obtained through this transform are the MFCCs. Usually only a restricted 
number of them (for instance the 13 first ones) are selected. The computation can be 
carried in a window sliding through the audio signal, resulting in a series of MFCC 
vectors, one for each successive frame, that can be represented column-wise in a ma- 
trix. Figure 2 shows an example of such matrix. The MFCCs are generally applied 
to distance computation between frames, and therefore to segmentation tasks. 




Fig. 2. Successive steps for the computation of MFCCs, illustrated with the analysis of an 
audio excerpt decomposed into frames. 



2.2 Example: Tonality analysis 

The spectrum is converted from the frequency domain to the pitch domain by apply- 
ing a log-frequency transformation. The distribution of the energy along the pitches 
is called the chromagram. The chromagram is then wrapped, by fusing the pitches 
belonging to same pitch classes. The wrapped chromagram shows therefore a distri- 
bution of the energy with respect to the twelve possible pitch classes. Kmmhansl and 
Schmuckler (Kmmhansl, 1990) proposed a method for estimating the tonality of a 
musical piece (or an extract thereof) by computing the cross-correlation of its pitch 
class distribution with the distribution associated to each possible tonality. These 
distributions have been established through listening experiments (Kmmhansl and 
Kessler, 1982). The most prevalent tonality is considered to be the tonality candidate 
with highest correlation (or key strength). This method was originally designed for 
the analysis of symbolic representations of music but has been extended to audio 
analysis through an adaptation of the pitch class distribution to the chromagram rep- 
resentation (Gomez, 2006). Figure 3 displays the successive steps of this approach. 
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Fig. 3. Successive steps for the calculation of chromagram and estimation of key strengths, 
illustrated with the analysis of an audio excerpt, this time not decomposed into frames. 



A richer representation of the tonality estimation can be drawn with the help 
of a self-organizing map (SOM), trained by the 24 tonal profiles (Toiviainen and 
Krumhansl, 2003). The configurations of the 24 classes after the training on the 
SOM corresponds to studies in music theory. The estimation of the tonality of the 
musical piece under study is carried out by projecting its wrapped chromagram onto 
the SOM. 
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Fig. 4. Activity pattern of a self-organizing map representing the tonal configuration of the 
first two seconds of Mozart Sonata in A major, K 331. High activity is represented by bright 
nuances. 




2.3 Example: Rhythm analysis 

One common way of estimating the rhythmic pulsation, described in figure 5, is 
based on auditory modelling (Tzanetakis and Cook, 1999). The audio signal is first 
decomposed into auditory channels using a bank of filters. The envelope of each 
channel is extracted. As pulsation is generally related to increase of energy only, 
the envelopes are differentiated, half-wave rectified, before being finally summed 
together again. This gives a precise description of the variation of energy produced 
by each note event from the different auditory channels. 

After this onset detection, the periodicity is estimated through autocorrelation. 
However, if the tempo variates throughout the piece, an autocorrelation of the whole 
sequence will not show clear periodicities. In such cases it is better to compute the 
autocorrelation on a moving window. This yields a periodogram that highlights the 
different periodicities, as shown in figure 5. In order to focus on the periodicities that 
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are more perceptible, the periodogram is filtered using a resonance curve (Toiviainen 
and Snyder, 2003), after which the best tempos are estimated through peak picking, 
and the results are converted into beat per minutes. Due to the difficulty of choosing 
among the possible multiples of the tempo, several candidates (three for instance) 
may be selected for each frame, and a histogram of all the candidates for all the 
frames, called periodicity histogram, can be drawn. 




Fig. 5. Successive steps for the estimation of tempo illustrated with the analysis of an audio 
excerpt. In the periodogram, high autocorrelation values are represented by bright nuances. 



3 Data analysis 

The toolbox includes diverse tools for data analysis, such as a peak extractor, and 
functions that compute histograms, entropy, zero-crossing rates, irregularity or var- 
ious statistical descriptors (centroid, spread, skewness, kurtosis, flatness) on data of 
various types, such as spectrum, envelope or histogram. The peak picker can accept 
any data returned by any other function of the MIRtoolbox and can adapt to the dif- 
ferent kinds of data of any number of dimensions. In the graphical representation 
of the results, the peaks are automatically located on the corresponding curves (for 
ID data) or bit-map images (for 2D data). We have designed a new strategy of peak 
selection, based on a notion of contrast, discarding peaks that are not sufficiently 
contrastive (based on a certain threshold) with the neighbouring peaks. This adaptive 
filtering strategy hence adapts to the local particularities of the curves. Its articula- 
tion with other more conventional thresholding strategies leads to an efficient peak 
picking module that can be applied throughout the MIRtoolbox. 

More elaborate tools have also been implemented that can carry out higher-level 
analyses and transformations. In particular, audio files can be automatically seg- 
mented into a series of homogeneous sections, through the estimation of temporal 
discontinuities in timbral features (Foote and Cooper, 2003). The resulting segments 
can then be clustered into classes, suggesting a formal analysis of the musical piece. 
Supervised classification of musical samples can also be performed, using techniques 
such as K-Nearest Neighbours or Gaussian Mixture Models. The results of feature 
extraction processes can be stored as text files of various format, such as the ARFF 
format that can be exported in the Weka machine learning environment (Witten and 
Frank, 2005). 
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4 Application to the study of music and emotion 

The toolbox is conceived in the context of a project investigating the interrelation- 
ships between perceived emotions and acoustic features. In a first study, musical 
features of musical materials collected form a large number of recent empirical stud- 
ies in music and emotions (15 so far) are systematically reanalysed. For emotion 
rating based on interval scales - using emotion dimensions such as emotional va- 
lence (liking, preference) and activity -, the mapping applies linear models, where 
ridge regression can handle the highly collinear variables. If on the contrary the emo- 
tion ratings contain categorical data (happy, sad, angry, scary or tender), supervised 
classification and linear methods such as discriminant analysis or logistic regression 
are used. The early results suggest that a substantial part (50-70%) of the variance 
in the emotion ratings can be explained by a small set of acoustic features, although 
the exact set of features is dependent on the musical genre. The existence of several 
different data sets representing different genres and data types is a challenging task 
for the selection of the appropriate statistical measures. 

A second study focuses on musical timbre. Listeners’ rating of 110 short instru- 
ment sounds (of constant pitch height, loudness and duration, but varying timbre) on 
four bipolar scales (valence, energy arousal, tension arousal, and preference) were 
correlated with the acoustic features. We found that positively valenced sounds tend 
to be dark in sound colour (see figure 6). Energy arousal, on the other hand, is more 
related to high spectral flux and high brightness {F? = .62). The tension arousal 
ratings {R^ = .70) are a mixture of high brightness, and roughness and inharmonic- 
ity. These observations extend the earlier findings relating to timbre and emotions 
(Scherer and Oshinsky, 1977; Juslin, 1997). 

The emotional connotations induced by instrument sounds alone are consistent 
across listeners and can be meaningfully connected to acoustic descriptors of tim- 
bre. Certain aspects of these features are already known from the expressive speech 
literature (e.g., brightness or high-frequency energy, Juslin and Laukka, 2003), but 
musical sounds have distinctive features such as inharmonicity and roughness which 
are reflected in emotion ratings. According to our view, subtle nuances in music- 
induced emotions can only be studied from the audio representation using tools that 
extract acoustic features in a fashion that is relevant for the perceptual processing of 
sounds. Further work is necessary for establishing the effectiveness and reliability of 
higher-level features - such as rhythmic patterns, tonality and harmony - in terms of 
their correspondence with the listener observations. 

Furthermore, both the MIDItoolbox and the multivariate techniques that were 
used to extract meaningful results out of the audio descriptors in the examples are 
well suited to other, similar tasks in content analysis of musical audio. For example, 
genre, instrument and artist recognition are such application areas where the ex- 
tracted features and multivariate statistics are used (see Tzanetakis and Cook, 2002). 
Applications such as these are especially important commercially, as ever-increasing 
volumes of recordings need to be automatically indexed. 

Following our first Matlab toolbox, called MIDItoolbox (Eerola and Toiviainen, 
2004), dedicated to the analysis of symbolic representations of music, the MIRtool- 
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Fig. 6. Correlation between listener rating of valence and acoustic brightness of short instru- 
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box will be available for free, from September 2007 on, at the following address: 
http : / /users . jyu . f i/~lartillo/mirtoolbox 

This work has been supported by the European Commission (NEST project 
“Tuning the Brain for Music", code 028570). 
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Abstract. Artificial systems with a high degree of autonomy require reliable semantic in- 
formation about the context they operate in. State interpretation, however, is a difficult task. 
Interpretations may depend on a history of states and there may be more than one valid in- 
terpretation. We propose a model for spatio-temporal situations using hidden Markov models 
based on relational state descriptions, which are extracted from the estimated state of an un- 
derlying dynamic system. Our model covers concurrent situations, scenarios with multiple 
agents, and situations of varying durations. To evaluate the practical usefulness of our model, 
we apply it to the concrete task of online traffic analysis. 



1 Introduction 

It is a fundamental ability for an autonomous agent to continuously monitor and un- 
derstand its internal states as well as the state of the environment. This ability allows 
the agent to make informed decisions in the future, to avoid risks, and to resolve 
ambiguities. Consider, for example, a driver assistance application that notifies the 
driver when a dangerous situation is developing, or a surveillance system at an airport 
that recognizes suspicious behaviors. Such applications do not only have to be aware 
of the current state, but also have to be able to interpret it in order to act rationally. 

State interpretation, however, is not an easy task as one has to also consider the 
spatio-temporal context, in which the current state is embedded. Intuitively, the agent 
has to understand the situation that is developing. The goals of this work are to for- 
mally define the concept of situation and to develop a sound probabilistic framework 
for modeling and recognizing situations. 

Related work includes Anderson et al. (2004) who propose relational Markov 
models with fully observable states. Fern and Givan (2004) describe an inference 
technique for sequences of hidden relational states. The hidden states must be in- 
ferred from observations. Their approach is based on logical constraints and un- 
certainties are not handled probabilistically. Kersting et al. (2006) propose logical 
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hidden Markov models where the probabilistic framework of hidden Markov models 
is integrated with a logical representation of the states. The states of our proposed 
situation models are represented by conjunctions of logical atoms instead of sin- 
gle atoms and we present a filtering technique based on a relational, non-parametric 
probabilistic representation of the observations. 



2 Framework for modeling and recognizing situations 

Dynamic and uncertain systems can in general be described using dynamic Baysian 
networks (DBNs) (Dean and Kanazawa (1989)). DBNs consist of a set of random 
variables that describe the system at each point in time t. The state of the system at 
time t is denoted by Xt and zt represents the observations. Furthermore, DBNs contain 
the conditional probability distributions that describe how the random variables are 
related. 




Fig. 1. Overview of the framework. At each time step t, the state Xt of the system is estimated 
from the observations zt- A relational description Ot of the estimated state is generated and 
evaluated against the different situation models kj , . . . ,k„. 



Intuitively, a situation is an interpretation associated to some states of the sys- 
tem. In principle, situations could be represented in such a DBN model by introduc- 
ing additional latent situation variables and by defining their influence on the rest 
of the system. Since this would lead to an explosion of network complexity already 
for moderately sized models, we introduce a relational abstraction layer between the 
system DBN used for estimating the state of the system, and the situation models 
used to recognize the situations associated to the state of the system. In this frame- 
work, we sequentially estimate the system state Xt from the observations Zt in the 
DBN model using the Bayes filtering scheme. In a second step within each time 
step, we transform the state estimate Xt to a relational state description Ot, which is 
then used to recognize instances of the different situation models. Figure 1 visualizes 
the structure of our proposed framework for situation recognition. 
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3 Modeling situations 

Based on the DBN model of the system outlined in the previous section, a situation 
can be described as a sequence of states with a meaningful interpretation. Since in 
general we are dealing with continuous state variables, it would be impractical or 
even impossible to reason about states, and state sequences directly in that space. 
Instead, we use an abstract representation of the states, and define situations as se- 
quences of these abstract states. 

3.1 Relational state representation 

For the abstract representation of the state of the system, relational logic will be used. 
In relational logic, an atom r(ti , . . . , tn) is an n-tuple of terms ti with a relation sym- 
bol r. A term can be either a variable R or a constant c. Relations can be defined over 
the state variables or over features that can be directly extracted from them. Table 1 
illustrates possible relations defined over the distance and bearing state variables in 
a traffic scenario. 



Table 1. Example distance and bearing relations for a traffic scenario. 



Relation 


Distances 


equal(R,R') 
close(R.R') 
mediuin(R, R^) 
f ar(R,R') 


[Om, Im) 
[lm,5m) 
[5 m, 15 m) 
[15m, 



Relation 


Bearing angles 


in_f roiit_of (R, R') 
right (R,R') 
behind(R,R') 
left(R,R') 


[315°, 45°) 
[45°, 135°) 
[135°, 225°) 
[225°, 315°) 



An abstract state is a conjunction of logical atoms (see also Cocora et al. (2006)). 
Consider for example the abstract state q = far(R,R'),behind(R,R'), which repre- 
sents all states in which a car is far and behind another car. 

3.2 Situation models 

Hidden Markov models (HMMs) (Rabiner (1989)) are used to describe the admis- 
sible sequences of states that correspond to a given situation. HMMs are temporal 
probabilistic models for analyzing and modeling sequential data. In our framework 
we use HMMs whose states correspond to conjunctions of relational atoms, that is, 
abstract states as described in the previous section. The state transition probabilities 
of the HMM specify the allowed transitions between these abstract states. In this 
way, HMMs specify a probability distribution over sequences of abstract states. 

To illustrate how HMMs and abstract states can be used to describe situations, 
consider a passing maneuver like the one depicted in Figure 2. Here, a reference 
car is passed by a faster car on the left hand side. The maneuver could be coarsely 
described in three steps: (1) the passing car is behind the reference car, (2) it is left of 
it, (3) and it is in front. Using, for example, the bearing relations presented in Table 1, 
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Fig. 2. passing maneuver and corresponding HMM. 



an HMM that describes this sequences could have three states, one for each step of 
the maneuver: qo = behind(R,R'), qi = left(R,R'), and q 2 = in_front_of (R,R'). 
The transition model of this HMM is depicted in Figure 2. It defines the allowed 
transitions between the states. Observe how the HMM specifies that when in the 
second state (qi), that is, when the passing car is left of the reference car, it can 
only remain left (qi) or move in front of the reference car (q 2 ). It is not allowed to 
move behind it again (qo). Such a sequence would not be a valid passing situation 
according to our description. 

A situation HMM consists of a tuple X = (Q,A,jt), where Q = {< 70 , • • ■ ,(1n} rep- 
resents a finite set of N states, which are in turn abstract states as described in the 
previous section, A = {a, 7 } is the state transition matrix where each entry atj repre- 
sents the probability of a transition from state qi to state qj, and Jt = {jt,} is the initial 
state distribution, where Jt, represents the probability of state being the initial state. 
Additionally, just as for the DBNs, there is also an observation model. In our case, 
this observation model is the same for every situation HMM, and will be described 
in detail in Section 4. 1 . 



4 Recognizing situations 

The idea behind our approach to situation recognition is to instantiate at each time 
step new candidate situation HMMs and to track these over time. A situation HMM 
can be instantiated if it assigns a positive probability to the current state of the sys- 
tem. Thus, at each time step t, the algorithm keeps track of a set of active situation 
hypotheses, based on a sequence of relational descriptions. 

The general algorithm for situation recognition and tracking is as follows. At 
every time step t, 

1. Estimate the current state of the system Xt (see Section 2). 

2. Generate relational representation Ot from xp. From the estimated state of the 
system Xt, a conjunction O; of grounded relational atoms with an associated prob- 
ability is generated (see next section). 

3. Update all instantiated situation HMMs according to op Bayes filtering is used 
to update the internal state of the instantiated situation HMMs. 
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4. Instantiate all non-redundant situation HMMs consistent with Ot'. Based on o<, all 
situation HMMs are grounded, that is, the variables in the abstract states of the 
HMM are replaced by the constant terms present in If a grounded HMM as- 
signs a non-zero probability to the current relational description Ot , the situation 
HMM can be instantiated. However, we must first check that no other situation 
of the same type and with the same grounding has an overlapping internal state. 
If this is the case, we keep the oldest instance since it provides a more accurate 
explanation for the observed sequence. 

4.1 Representing uncertainty at the relational level 

At each time step t, our algorithm estimates the state Xt of the system. The estimated 
state is usually represented through a probability distribution which assigns a proba- 
bility to each possible hypothesis about the true state. In order to be able to use the 
situation HMMs to recognize situation instances, we need to represent the estimated 
state of the system as a grounded abstract state using relational logic. 

To convert the uncertainties related to the estimated state Xt into appropriate un- 
certainties at the relational level, we assign to each relation the probability mass 
associated to the interval of the state space that it represents. The resulting distribu- 
tion is thus a histogram that assigns to each relation a single cumulative probability. 
Such a histogram can be thought of as a piecewise constant approximation of the 
continuous density. The relational description Oi of the estimated state of the system 
Xt at time t is then a grounded abstract state where each relation has an associated 
probability. 

The probability P{ot\qi) of observing Ot while being in a grounded abstract state 
qt is computed as the product of the matching terms in o, and qt. In this way, the 
observation probabilities needed to estimate the internal state of the situation HMMs 
and the likelihood of a given sequence of observations 0\-,t = (oi,...,o<) can be 
computed. 

4.2 Situation model selection using Bayes factors 

The algorithm for recognizing situations keeps track of a set of active situation hy- 
pothesis at each time step t. We propose to decide between models at a given time t 
using Bayes factors for comparing two competing situation HMMs that explain the 
given observation sequence. Bayes factors (Kass and Raftery (1995)) provide a way 
of evaluating evidence in favor of a probabilistic model as opposed to another one. 
The Bayes factor B\ 2 for two competing models Xi and X 2 is computed as 

P{'^2\Ot2:t2+n2) P{Ot2t2+n2 \k2)pQ^2y 

that is, the ratio between the likelihood of the models being compared given the data. 
The Bayes factor can be interpreted as evidence provided by the data in favor of a 
model as opposed to another one (Jeffreys (1961)). 
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In order to use the Bayes factor as evaluation criterion, the observation sequence 
Of.t+n which the models in Equation 1 are conditioned on, must be the same for the 
two models being compared. This is, however, not always the case, since situation 
can be instantiated at any point in time. To solve this problem we propose a solution 
used for sequence alignment in bio-informatics (Durbin et al. (1998)) and extend the 
situation model using a separate world model to account for the missing part of the 
observation sequence. This world model in our case is defined analogously to the 
bigram models that are learn from the corpora in the field of natural language pro- 
cessing (Manning and Schiltze (1999)). By using the extended situation model, we 
can use Bayes factors to evaluate two situation models even if they where instantiated 
at different points in time. 



5 Evaluation 

Our framework was implemented and tested in a traffic scenario using a simulated 
3D environment. TORCS - The Open Racing Car Simulator (Espie and Guionneau) 
was used as simulation environment. The scenario consisted of several autonomous 
vehicles with simple driving behaviors and one reference vehicle controlled by a 
human operator. Random noise was added to the pose of the vehicles to simulate un- 
certainty at the state estimation level. The goal of the experiments is to demonstrate 
that our framework can be used to model and successfully recognize different sit- 
uations in dynamic multi-agent environments. Concretely, three different situations 
relative to a reference car where considered: 

1 . The passing situation corresponds to the reference car being passed by another 
car. The passing car approaches the reference car from behind, it passes it on the 
left, and finally ends up in front of it. 

2. The aborted passing situation is similar to the passing situation, but the reference 
car is never fully overtaken. The passing car approaches the reference car from 
behind, it slows down before being abeam, and ends up behind it again. 

3. The follow situation corresponds to the reference car being followed from behind 
by another car at a short distance and at the same velocity. 

The structure and parameters of the corresponding situation HMMs where defined 
manually. The relations considered for these experiments where defined over the 
relative distance, position, and velocity of the cars. 

Figure 3 (left) plots the likelihood of an observation sequence corresponding to 
a passing maneuver. During this maneuver, the passing car approaches the reference 
car from behind. Once at close distance, it maintains the distance for a couple of 
seconds. It then accelerates and passes the reference car on the left to finally end up 
in front of it. It can be observed in the figure how the algorithm correctly instan- 
tiated the different situation HMMs and tracked the different instances during the 
execution of the maneuver. For example, the passing and aborted passing situations 
where instantiated simultaneously from the start, since both situation HMMs initially 
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Fig. 3. (Left) Likelihood of the observation sequence for a passing maneuver according to 
the different situation models, and (right) Bayes factor in favor of the passing situation model 
against the other situation models. 



describe the same sequence of observations. The /o/Zow situation HMM was instanti- 
ated, as expected, at the point where both cars where close enough and their relative 
velocity was almost zero. Observe too that at this point, the likelihood according to 
the passing and aborted passing situation HMMs starts to decrease rapidly, since 
these two models do not expect both cars to drive at the same speed. As the passing 
vehicle starts changing to the left lane, the HMM for the follow situation stops pro- 
viding an explanation for the observation sequence and, accordingly, the likelihood 
starts to decrease rapidly until it becomes almost zero. At this point the instance of 
the situation is not tracked anymore and is removed from the active situation set. This 
happens since the follow situation HMM does not expect the vehicle to speed up and 
change lanes. 

The Bayes factor in favor of the passing situation model compared against the 
follow situation model is depicted in Figure 3 (right). A positive Bayes factor value 
indicates that there is evidence in favor of the passing situation model. Observe that 
up to the point where the follow situation is actually instantiated the Bayes factor 
keeps increasing rapidly. At the time where both cars are equally fast, the evidence 
in favor of the passing situation model starts decreasing until it becomes negative. At 
this point there is evidence against the passing situation model, that is, there is evi- 
dence in favor of the follow situation. Finally, as the passing vehicle starts changing 
to the left lane the evidence in favor of the passing situation model starts increas- 
ing again. Figure 3 (right) shows how Bayes factors can be used to make decisions 
between competing situation models. 



6 Conclusions and further work 

We presented a general framework for modeling and recognizing situations. Our ap- 
proach uses a relational description of the state space and hidden Markov models to 
represent situations. An algorithm was presented to recognize and track situations 
in an online fashion. The Bayes factor was proposed as evaluation criterion between 
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two competing models. Using our framework, many meaningful situations can be 
modeled. Experiments demonstrate that our framework is capable of tracking multi- 
ple situation hypotheses in a dynamic multi-agent environment. 
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Abstract. Reliable automatic methods are needed for statistical online monitoring of noisy 
time series. Application of a robust scale estimator allows to use adaptive thresholds for the 
detection of outliers and level shifts. We propose a fast update algorithm for the estimator 
and show by simulations that it leads to more powerful tests than other highly robust scale 
estimators. 



1 Introduction 

Reliable online analysis of high frequency time series is an important requirement 
for real-time decision support. For example, automatic alarm systems currently used 
in intensive care produce a high rate of false alarms due to measurement artifacts, 
patient movements, or transient fluctuations around the chosen alarm limit. Prepro- 
cessing the data by extracting the underlying level (the signal) and variability of the 
monitored physiological time series, such as heart rate or blood pressure can improve 
the false alarm rate. Additionally, it is necessary to detect relevant changes in the ex- 
tracted signal since they might point at serious changes in the patient’s condition. 

The high number of artifacts observed in many time series requires the applica- 
tion of robust methods which are able to withstand some largely deviating values. 
However, many robust methods are computationally too demanding for real time 
application if efficient algorithms are not available. 

Gather and Fried (2003) recommend Rousseeuw and Croux’s (1993) Qn estima- 
tor to measure the variability of the noise in robust signal extraction. The Q„ pos- 
sesses a breakdown point of 50%, i.e. it can resist up to almost 50% large outliers 
without becoming extremely biased. Additionally, its Gaussian efficiency is 82% in 
large samples, which is much higher than that of other robust scale estimators: for ex- 
ample, the asymptotic efficiency of the median absolute deviation about the median 
(MAD) is only 36%. However, in an online application to moving time windows the 
MAD can be updated in O(logn) time (Bernholt et al. (2006)), while the fastest algo- 
rithm known so far for the Q„ needs 0{n\ogn) time (Croux and Rousseeuw (1992)), 
where n is the width of the time window. 
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In this paper, we construct an update algorithm for the Q„ estimator which, in 
practice, is substantially faster than the offline algorithm and implies an advantage 
for online application. The algorithm is easy to implement and can also be used 
to compute the Hodges-Lehmann location estimator (HL) online. Additionally, we 
show by simulation that the leads to resistant rules for shift detection which have 
higher power than rules using other highly robust scale estimators. This better power 
can be explained by the well-known high efficiency of the Q„ for estimation of the 
variability. 

Section 2 presents the update algorithm for the Section 3 describes a com- 
parative study of rules for level shift detection which apply a robust scale estimator 
for fixing the thresholds. Section 4 draws some conclusions. 



2 An update algorithm for the Qn and the HL estimator 

For data xi,... ,x„, Xi £ M and k = denoting the largest integer not 

larger than a, the Qn scale estimator is defined as 

q(2) = [\xi-Xj\, l<i<j< , 

corresponding to approximately the first quartile of all pairwise differences. Here, 
denotes a finite sample correction factor for achieving unbiasedness for the 
estimation of the standard deviation a at Gaussian samples of size n. For online 
analysis of a time series xi,...,xn, we can apply the Q„ to a moving time win- 
dow Xi^n+i ,■■■ ,Xt of width n < N, always adding the incoming observation and 

deleting the oldest observation Xt-n+i when moving the time window from r to t -F 1. 
Addition of Xt+i and deletion of Xi-n+i is called an update in the following. 

It is possible to compute the Q„ as well as the HL estimator of n observations with 
an algorithm by Johnson and Mizoguchi (1978) in running time O(nlogn), which has 
been proved to be optimal for offline calculation. An optimal online update algorithm 
therefore needs at least O(logn) time for insertion or deletion, respectively, since 
otherwise we could construct an algorithm faster than O(nlogn) for calculating the 
Qn from scratch. The O(logn) time bound was achieved for A: = 1 by Bespamyatnikh 
(1998). For larger k - as needed for the computation of Qn or the HL estimator - the 
problem gets more difficult and to our knowledge there is no online algorithm, yet. 
Following an idea of Smid (1991), we use a buffer of possible solutions to get an 
online algorithm for general k, because it is easy to implement and achieves a good 
running time in practice. Theoretically, the worst case amortized time per update may 
not be better than the offline algorithm, because k = 0{n^) in our case. However, we 
can show that our algorithm runs substantially faster for many data sets. 

Lemma 1. It is possible to compute the Qn and the HL estimator by computing the 
kth order statistic in a multiset of form X-\-Y = {xi+yj \xi£X and yj £Y}. 

Proof For X = {x \,. . . ,x„}, k! = and k = k' -F n -F Q) we may compute 

the Qn in the following way: 
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<i<j< m}(a-/) = 1 < ij < «}(^) . 

Therefore me may compute the Qn by computing the A:th order statistic in A + (— ). 

To compute the HL estimator // = median | (x, + )/2, 1 < t < y < n} , we only 

need to compute the median element in A/2 + A/2 following the convention that in 
multisets of form X + X exactly one of x, +X; and xj + xi appears for each i and j. □ 

To compute the A:th order statistic in a multiset of form A + T, we use the al- 
gorithm of Johnson and Mizoguchi (1978). Due to Lemma 1, we only consider the 
online version of this algorithm in the following. 

2.1 Online algorithm 

To understand the online algorithm it is helpful to look at some properties of the 
offline algorithm. It is convenient to visualize the algorithm working on a partially 
sorted matrix B — (bij) with bij = X(,) +y(;), although B is, of course, never con- 
structed. The algorithm utilizes, that X(,) and xq) < X(^) 
for; < £. In consecutive steps, a matrix element is selected, regions in the matrix are 
determined to be certainly smaller or certainly greater than this element, and parts of 
the matrix are excluded from further consideration according to a case differentia- 
tion. As soon as less than n elements remain for consideration, they are sorted and the 
sought-after element is returned. The algorithm may easily be extended to compute 
a buffer B of size 5 of matrix elements &(^l(.v- 1 )/ 2 J):« 2 : • • ■ 

To achieve a better computation time in online application, we use balanced trees, 
more precisely indexed AVL-trees, as the main data structure. Inserting, deleting, 
finding and determining the rank of an element needs O(logn) time in this data 
structure. We additionally use two pointers for each element in a balanced tree. In 
detail, we store A, Y, and B in separate balanced trees and let the pointers of an 
element bij = X(,) + y(jj G B point to X(,) € A and G Y, respectively. The first 
and second pointer of an element X(,) G X points to the smallest and greatest element 
such that bjj G B for 1 < ;• < «. The pointers for an element G Y are defined 
analogously. 

Insertion and deletion of data points into the buffer B correspond to the insertion 
and deletion of matrix rows or columns in B. We only consider insertions into and 
deletions from A in the following, because they are similar to insertions into and 
deletions from Y. 

Deletion of element Xdei 

1 . Search in A for Xdei and determine its rank i and the elements b^ and bg pointed 
at. 

2. Determine and with the help of the pointers such that b^ = X(,) +y(;) and 
bg = x^i)+y^ey 

3. Find all elements b^ = X(,) -l-y{m) G B with j <m<£. 

4. Delete these elements b^ from B, delete Xdei from A, and update the pointers 
accordingly. 
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5. Compute the new position of the A:th element in ‘S. 

Insertion of element Xins 

1. Determine the smallest element bg and the greatest element bg in (B. 

2. Determine with a binary search the smallest j such that Xins +3'(;) ^ the 

greatest I such that Xins + y(£) < bg. 

3. Compute all elements bm = -^ins + y(m) with j <m<£. 

4. Insert these elements bm into insert Xins into X and update pointers to and 
from the inserted elements accordingly. 

5. Compute the new position of the ^th element in 

It is easy to see, that we need a maximum of 0( [deleted elements! log n) nnd 
0(1 inserted elements! logn) time for deletion and insertion, respectively. After dele- 
tion and insertion we determine the new position of the k\h element in ® and return 
the new solution or recompute ® with the offline algorithm if the A:th element is not 
in ‘B any more. We may also introduce hounds on the size of B in order to maintain 
linear size and to recompute B if these bounds are violated. 

For the running time we have to consider the number of elements in the buffer 
that depend on the inserted or deleted element and the amount the kih element may 
move in the buffer. 

Theorem I. For a constant signal with stationary noise, the expected amortized time 
per update is O(logn). 

Proof. In a constant signal with stationary noise, data points are exchangeable in the 
sense that the rank of each data point in the set of all data points is equiprobable. 
Assume w.l.o.g. that we only insert into and delete from X. Consider for each rank i 
of an element in X the number of buffer elements depending on it, i.e. | {i [ bij € | . 

With 0{n) elements in B and equiprobable ranks of the observations inserted into or 
deleted from A, the expected number of buffer elements depending on an observation 
is 0(1). Thus, the expected number of buffer elements to delete or insert during an 
update step is also 0(1) and the expected time we spend for the update is O(logn). 

To calculate the amortized running time, we have to consider the number of times 
B has to be recomputed. With equiprobable ranks, the expected amount the A:th el- 
ement moves in the buffer for a deletion and a subsequent insertion is 0. Thus, the 
expected time the buffer has to be recomputed is also 0 and consequently, the ex- 
pected amortized time per update is O(logn). □ 

2.2 Running time simulations 

To show the good performance of the algorithm in practice, we conducted some 
running time simulations for online computation of the Q„ . The first data set for the 
simulations suits the conditions of Theorem 1, i.e. it consists of a constant signal 
with standard normal noise and an additional 10% outliers of size 8. The second data 
set is the same in the first third of the time period, before an upward shift of size 8 
and a linear upward trend in the second third and another downward shift of size 8 
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Fig. 1. Insertions and deletions needed for an update with growing window size n. 




Fig. 2. Positions of ® in the matrix B for data set 1 (left) and 2 (right). 



and a linear downward trend in the final third occur. The reason to look at this data 
set is to analyze situations with shifts, trends and trend changes, because these are 
not covered by Theorem 1 . 

We analyzed the average number of buffer insertions and deletions needed for an 
update when performing 3n updates of windows of size n with 10 < n < 500. Re- 
call, that the insertions and deletions directly determine the running time. A variable 
number of updates assures similar conditions for all window widths. Additionally, 
we analyzed the position of ® over time visualized in the matrix B when performing 
3000 updates with a window of size 1000. 

We see in Figure 1 that the number of buffer insertions and deletions for the first 
data set seems to be constant as expected, apart from a slight increase caused by the 
10% outliers. The second data set causes a stronger increase, but is still far from the 
theoretical worst case of 4n insertions and deletions. 

Considering Figure 2 we gain some insight into the observed number of update 
steps. For the first data set, elements of ® are restricted to a small region in the matrix 
B. This region is recovered for the first third of the second data set in the right-hand 
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side figure. The trends in the second data set cause ‘B to be in an additional, even 
more concentrated diagonal region, which is even better for the algorithm. The cause 
for the increased running time is the time it takes to adapt to trend changes. After a 
trend change there is a short period, in which parts of B are situated in a wider region 
of the matrix B. 



3 Comparative study 

An important task in signal extraction is the fast and reliable detection of abrupt 
level shifts. Comparison of two medians calculated from different windows has been 
suggested for the detection of such edges in images (Bovik and Munson (1986), 
Hwang and Haddad (1994)). This approach has been found to give good results also 
in signal processing (Fried (2007)). Similar as for the two-sample t-test, an estimate 
of the noise variance is needed for standardization. Robust scale estimators like the 
Qn can be applied for this task. Assuming that the noise variance can vary over time 
but is locally constant within each window, we calculate both the median and the Q„ 
separately from two time windows yt-h+i, ■ ■ - ,yt and yt +\ , . . . ,yr+jt for the detection 
of a level shift between times t and t + 1. Let Jit- and be the medians from the 
two time windows, and d^_ and d,+ be the scale estimate for the left and the right 
window of possibly different widths h and k. An asymptotically standard normal test 
statistic in case of a (locally) constant signal and Gaussian noise with a constant 
variance is 

Bt+ — h- 

^jQ.5n{oj_/h + d}+/k) 

Critical values for small sample sizes can be derived by simulation. 

Figure 3 compares the efficiencies of the Q„, the median absolute deviation about 
the median (MAD) and the interquartile range (IQR) measured as the percentage vari- 
ance of the empirical standard deviation as a function of the sample size n, derived 
from 200000 simulation runs for each n. Obviously, the Q„ is much more efficient 
than the other, ’classical’ robust scale estimators. 

The higher efficiency of the Qn is an intuitive explanation for median compar- 
isons standardized by the Qn having higher power than those standardized by the 
MAD or the IQR if the windows are not very short. The power functions depicted in 
Figure 3 for the case h = k = 15 have been derived from shifts of several heights 
5 = 0,1,..., 6 overlaid by standard Gaussian noise, using 10000 simulation runs 
each. The two-sample t-test, which is included for the reason of comparison, of- 
fers under Gaussian assumptions higher power than all the median comparisons, of 
course. However, Figure 3 shows that its power can drop down to zero because of 
a single outlier, even if the shift is huge. To see this, a shift of fixed size 10a was 
generated, and a single outlier of increasing size into the opposite direction of the 
shift inserted briefly after the shift. The median comparisons are not affected by a 
single outlier even if windows as short ash = k = l are used. 




Applying the Estimator Online 283 




outlier size 



number of deviating observations 



Fig. 3. Gaussian efficiencies (top left), power of shift detection (top right), power for a 10a- 
shift in case of an outlier of increasing size (bottom left), and detection rate in case of an 
increasing number of deviating observations (bottom right): Q„ (solid), MAD (dashed), IQR 
(dotted), and (dashed-dot). The two-sample t-test (thin solid) is included for the reason of 
comparison. 



As a final exercise, we treat shift detection in case of an increasing number of 
deviating observations in the right-hand window. Since a few outliers should neither 
mask a shift nor cause false detection when the signal is constant, we would like 
a test to resist the deviating observations until more than half of the observations 
are shifted, and to detect a shift from then on. Figure 3 shows the detection rates 
calculated as the percentage of cases in which a shift was detected for h = k = 1. 
Median comparisons with the Q„ behave as desired, while a few outliers can mask 
a shift when using the IQR for standardization, similar as for the t-test. This can be 
explained by the IQR having a smaller breakdown point than the Q„ and the MAD. 



4 Conclusions 

The proposed new update algorithm for calculation of the Qn scale estimator or the 
Hodges-Lehmann location estimator in a moving time window shows good running 
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time behavior in different data situations. The real time application of these esti- 
mators, which are both robust and quite efficient, is thus rendered possible. This is 
interesting for practice since the comparative studies reported here show that the 
good efficiency of the Q„ for instance improves edge detection as compared to other 
robust estimators. 
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Abstract. A general harmonic model for pitch tracking of polyphonic musical time series will 
be introduced. Based on a model of Davy and Godsill (2002) the fundamental frequencies 
of polyphonic sound are estimated simultaneously. For an improvement of these results a 
preprocessing step was be implemented to build an extended polyphonic model. 

All methods are applied on real audio data from the McGill University Master Samples 
(Opolko and Wapnick (1987)). 



1 Introduction 

The automatic transcription of musical time series data is a wide research domain. 
There are many methods for the pitch tracking of monophonic sound (e.g. Weihs and 
Ligges (2006)). More difficult is the distinction of polyphonic sound because of the 
properties of the time series of musical sound. 

In this research paper we describe a general harmonic model for polyphonic mu- 
sical time series data, based on a model of Davy and Godsill (2002). After trans- 
forming this model to an hierarchical bayes model the fundamental frequencies of 
this data can be estimated with MCMC methods. 

Then we consider a preprocessing step to improve the results. For this, we intro- 
duce the design of an alphabet of artificial tones. 

After that we apply the polyphonic model to real audio data from the McGill Uni- 
versity Master Samples (Opolko and Wapnick (1987)). We demonstrate the building 
of an alphabet on real audio data and present the results of utilising such an alphabet. 
Further, we show first results of combining the preprocessing step and the MCMC 
methods. Finally the results are discussed and an outlook to future work is given. 



2 Polyphonic model 

In this section the harmonic polyphonic model will be introduced and its components 
will be illustrated. The model is based on the model of Davy and Godsill (2002) and 
has the following structure: 
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Fig. 1. Illustration of the modelling with basis functions. Modelling time-variant amplitudes 
of a real audio signal 

K H I 

yt = 

k=lh=l i=0 

The number of observations of the audio signal is T, r G {0, . . . , T — 1}. Each signal 
is normalized to [—1, 1] since the absolute overall loudness of different recordings is 
not relevant. The signal yt is made up of K tones each composed out of harmonics 
from Hk partial tones. In this paper the number of tones K is assumed to be known. 
The first partial of the ^-th tone is the fundamental frequency fk, the other Hk~ 1 
partials are called overtones. Further, is the sampling rate. 

To reduce the number of parameters to be estimated, the amplitudes ak^h,t and 
bk,h,t of the k—ih tone and the h-th partial tone at each timepoint t are modelled with 
7+1 basis functions. The basis functions (|);,, are equally spaced banning windows 
with 50% overlap: 



<^t,i ■■= cos^ [n{t - /A)/(2A)] 1[(,-_i)a.(,'+i)a](0. ^ = (T - 1)//. 

So the ak.h.i and bk,hj are the amplitudes of the A:-th tone, the h-ih partial tone and the 
i-th basis function. Finally, is the model error. 

Figure 1 shows the necessity of using basis functions and thus modelling time- 
variant amplitudes. In the figure fhe points are the observations of the real signal. The 
assumption of constant amplitudes over time cannot depict the higher amplitudes at 
the beginning of the tone (black line). Modelling with time- variant amplitudes (grey 
line) leads to better results. 

The model can be written as a hierarchical bayes model. The estimation of the pa- 
rameters results from stochastic search for the best coefficients in a given region with 
different prior distributions. The region and the probabilities are specified by distri- 
butions. This leads to the implementation of MCMC methods (Gilks et al. (1996)). 





Polyphonic Musical Time Series 287 



For the sampling of the fundamental frequency fk variants of the Metropolis- 
Hustings- Algorithm are used where the candidate frequencies are generated in dif- 
ferent ways. 

In the first variant the candidate for the fundamental frequency is sampled from 
a uniform distribution in the range of the possible frequencies. In the second variant 
the new candidate for the fundamental frequency is the half or the double frequency 
of the actual fundamental frequency. In the third variant a random walk is used which 
allows small changes of the fundamental frequency to get a more precise result. 

For the determination of the number of partial tones a reversible jump MCMC 
was implemented. In each iteration of the MCMC-computation one of these algo- 
rithms is chosen with a distinct probability. 

The parameters of the amplitude and bk,h,i are computed conditional on the 
fundamental frequency /j. and the number of partial tones Hj^. 

There is no full generation of the posterior distributions due to the computational 
burden. Instead we use a stopping criterion to stop the iterations if the slope of the 
model error is no longer significant (Sommer and Weihs (2006)). 



3 Extended polyphonic model 

An extented polyphonic model with an additional preprocessing step to the MCMC- 
algorithms will be established in this section. The results of this step could be the 
starting values for the MCMC algorithm in order to improve the results. 

For this purpose we constructed an alphabet of artificial tones. These artificial 
tones are compared with the audio data to be analysed. The artificial tones are com- 
posed by evaluating the periodograms of seven time intervals with 512 observations 
of a real audio signal with 50% overlap. So a time interval of 2048 observations is re- 
garded. At a sampling rate of 1 1 025 Hz a time interval of 0. 1 86 seconds is observed. 

These seven periodograms are averaged to a mean periodogram. For better com- 
parability all values in this periodogram are set to zero which are smaller than one 
percent of the maximum peak. All artificial tones together form the alphabet. 

In figure 2 (upper part) a periodogram of a c4 (262 Hz) played by an electric 
guitar can be seen. The lower part of figure 2 shows the small values of the peri- 
odogram. The horizontal line reflects the value of one percent of the maximum value 
of the periodogram. All values below this line are set to zero in the alphabet. 

To determine the correct notes, every combination of two artificial tones of the 
alphabet is matched to the periodogram of the real audio signal. The modified pe- 
riodograms of the two artificial tones are summed up to one periodograms. These 
periodograms are compared with the audio signal. The notes corresponding to the 
two artificial tones which cause minimal error are considered as estimates for the 
true notes. Finally, voting over ten time intervals leads to the estimation of the fun- 
damental frequencies. 
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Fig. 2. Periodogram of note c4 played with an electric guitar. Original (upper part) and zoomed 
in with cut-off line (lower part) 



4 Results 

In this section results of estimating the fundamental frequencies of real audio data 
will be hgured out. First, the data used in our studies will be introduced. Then hrst 
results are shown. Further the construction of an alphabet will be reconsidered and 
then the results based on this alphabet are depicted. Finally additional results are 
shown. 

4.1 Data 

The data used for our monophonic and polyphonic studies are real audio data from 
the McGill University Master Samples (Opolko and Wapnick (1987)). We chose 5 
instruments (electric guitar, piano, violin, flute and trumpet) each with 5 notes (262, 
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Table 1. 1 if both notes were correctly identified, 0 otherwise. The left hand table requires the 
exact note to be estimated, the right table also counts octaves of the note as correct. 



instrument 



notes 


flu 


guit 


pian 


trum 


viol 


c4-c4 


1 


1 


1 


1 


1 


c4-e4 


0 


1 


0 


0 


1 


c4-g4 


0 


0 


0 


0 


0 


c4-a4 


1 


1 


1 


0 


0 


c4-c5 


1 


1 


1 


1 


1 



instrument 



notes 


flu 


guit 


pian 


trum 


viol 


c4-c4 


1 


1 


1 


1 


1 


c4-e4 


0 


1 


0 


1 


1 


c4-g4 


1 


1 


1 


1 


1 


c4-a4 


1 


1 


1 


0 


0 


c4-c5 


1 


1 


1 


1 


1 



330, 390, 440 and 523 Hz) out of two groups of instruments, string instruments and 
wind instruments. The three string instruments are played in different ways, namely 
picked, struck and bowed. The two wind instruments are a woodwind instrument and 
a brass instrument. 

For polyphonic data we superimposed the oszillations of two tones. The first 
tone was a c4 (262 Hz) played by the piano. This tone was combined with each 
instrument-tone combination we used. So we had 25 datasets each normalized to 
[—1,1]. The pitches of the tones were tracked over ten time intervals of T = 512 
observations with 50% overlap at a sampling rate of 1 1 025 Hz. The number of ob- 
servations in one time interval is a tradeoff between the computational burden and 
the quality of the estimate. The estimate of the notes is the result of voting over the 
ten time intervals. The estimated notes are the two notes which occur in the ten time 
intervals most often. 

4.2 First results with polyphonic model 

The first step in our analysis was to consider how good the model works and if the 
pitch of a tone is estimated exactly. For this purpose we made a first study with 
monophonic data. The results of the study with monophonic time series data were 
very promising. In most cases the correct note was estimated and the deviations from 
the correct fundamental frequencies were minor (Sommer and Weihs (2006)). 

The results of the estimation of polyphonic time series data are not as promising 
as the results with monophonic time series data. There are many notes which are not 
estimated correctly. The left side of Table 1 shows 1 if both notes were estimated 
correctly and 0 otherwise. In 15 of the 25 experiments both notes were estimated 
correctly. Counting octaves of the notes as correct increases the number of correct 
estimates to 21 (see the right hand side of Table 1). It can be seen that all notes of 
the combination c4-g4 are estimated incorrectly, but they are correct by counting the 
octaves of the right notes as correct (Sommer and Weihs (2007)). 

Analysing the data over 20, 30 and 50 time intervals results in the same outcomes. 
So it seems to be adequate to examine 10 time intervals. In longer interval series new 
correctly estimated notes could not be determined. 
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Table 2. 1 if both notes AND instruments are correctly recognized after voting and 1* if both 
notes are estimated correcty, but not the instrument (left), including octaves of the correct 
notes (right). In 22 (left) and 23 (right) cases both notes are estimated correctly, in 18 cases 
for both tones the correct instrument is recognized. 



instrument 



notes 


flu 


guit 


pian 


trum 


viol 


c4-c4 


0 


0 


1* 


1* 


1 


c4-e4 


1 


1 


1 


1 


1* 


c4-g4 


1 


1 


1 


1 


1 


c4-a4 


1 


1 


1 


1 


1 


c4-c5 


1 


1 


0 


1* 


1 



instrument 



notes 


flu 


guit 


pian 


trum 


viol 


c4-c4 


0 


0 


1* 


1* 


1 


c4-e4 


1 


1 


1 


1 


1* 


c4-g4 


1 


1 


1 


1 


1 


c4-a4 


1 


1 


1 


1 


1 


c4-c5 


1 


1 


1* 


1* 


1 



4.3 Results with extended polyphonic model 

In a first study with an alphabet of artificial tones we used 30 notes from g3 (196 Hz) 
to c6 (1 047 Hz) of the same five instruments as for the studies in section 4.2. The 
choice of this range is restricted by the availability of the data of the McGill Univer- 
sity Master Samples. The mean periodogram is computed out of seven periodograms 
each with T = 512 observations with 50% overlap at a sampling rate of 11 025 Hz. 
The first 1000 observations of a note were not considered for this periodogram in 
order to omit the attack of an instrument. Overall there are 150 artificial notes in the 
alphabet. 

With this alphabet 1 1 325 pairwise comparisons of two artificial tones with the 
audio signal have to be computed. The results of the estimates of the same 25 note- 
combinations used in the previous study can be seen in table 2. The left hand side 
of the table shows that in 22 of 25 cases the fundamental frequency of both notes is 
estimated correctly. If octaves of the correct notes are counted as correct this number 
increases to 23 (right hand side of table 2). 

Further, the entries in table 2 are annotated with a star if the instruments are not 
recognized correctly. This means that only in 18 of 22 cases (18 of 23) the instru- 
ments of both notes are identified correctly. Moreover, it can be seen that the cases 
where the notes are estimated incorrectly occur only in the first and last rows of the 
tables. So the correct estimation of the notes seems to be a problem if both notes are 
the same or one is the octave of the other. 

4.4 Further results 

Using these estimated notes as starting values for the MCMC algorithm in order to 
estimate the fundamental frequencies more precisely does not lead to an improve- 
ment of the results of the preprossing. To the contrary, the results are comparable to 
the results without this preprocessing step. In most of the cases the estimated notes 
are the octave of the correct notes. Often, the MCMC algorithm leads to an esti- 
mate H = 1 of the number of partial tones. This often meant that only the octave of 
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the fundamental frequency is found and neither the fundamental itself nor any other 
overtones. 

A solution to this problem is the limitation of the possible range of frequencies. 
Restricting the frequency to the same range which the alphabet is covering and forc- 
ing the number of partial tones to be greater than 1 yields 20 respectively 24 correct 
estimations of both notes. A further improvement can be achieved by applying two 
chains in the MCMC algorithm. Starting values for both chains are equal, namely 
the results of the preprocessing. For each time interval the chain with the minimal 
model error is chosen. Voting over the ten time intervals results in 22 respectively 25 
correct estimates. There are no more incorrectly estimated notes, in the worst case 
octaves of the correct notes. Also, voting is based on many more correct notes in the 
individual time intervals of 512 observations than in our previous studies, i.e. now 
typically five or six estimates are correct in contrast to three before. 



5 Conclusion 

In this paper a pitch tracking model for polyphonic musical time series data has 
been introduced. The unknown parameters are estimated with an MCMC algorithm 
as a stochastic optimization procedure. Because of the unfavorable results in a first 
study with polyphonic data the polyphonic model was extended and a preprocessing 
step was implemented. The application of an alphabet of artificial notes leads to 
promising results. The combination of the preprocessing and the MCMC algorithm 
is even more encouraging after the limitation of the frequency range. 

Further work will extend the alphabet by using more artificial tones and consid- 
ering attack, sustain and release, the different phases of a realisation of a note. An 
additional aim is the construction of a complete alphabet on the whole audio data of 
the McGill Universitiy Master Samples. 
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Abstract. In this paper, we present an algorithm to identify types of places and objects from 
2D and 3D laser range data obtained in indoor environments. Our approach is a combination 
of a collective classification method based on associative Markov networks together with an 
instance-based feature extraction using nearest neighbor. Additionally, we show how to select 
the best features needed to represent the objects and places, reducing the time needed for the 
learning and inference steps while maintaining high classification rates. Experimental results 
in real data demonstrate the effectiveness of our approach in indoor environments. 



1 Introduction 

One key application in mobile robotics is the creation of geometric maps using data 
gathered with range sensors in indoor environments. These maps are usually used for 
navigation and represent free and occupied spaces. However, whenever the robots 
are designed to interact with humans, it seems necessary to extend these representa- 
tions of the environment to improve the human-robot communication. In this work, 
we present an approach to extend indoor laser-based maps with semantic terms like 
“corridor”, “room”, “chair”, “table”, etc, used to annotate different places and ob- 
jects in 2D or 3D maps. We introduce the instance-based associative Markov net- 
work (iAMN), which is an extension of associative Markov networks together with 
instance-based nearest neighbor methods. The approach follows the concept of col- 
lective classification in the sense that the labeling of a data point in the space is partly 
influenced by the labeling of its neighboring points. iAMNs classify the points in a 
map using a set of features representing these points. In this work, we show how to 
choose these features in the different cases of 2D and 3D laser scans. Experimental 
results obtained in simulation and with real robots demonstrate the effectiveness of 
our approach in various indoor environments. 
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2 Related work 

Several authors have considered the problem of adding semantic information to 2D 
maps. Koenig and Simmons (1998) apply a pre-programmed routine to detect door- 
ways. Althaus and Christensen (2003) use sonar data to detect corridors and door- 
ways. Moreover, Friedman et al. (2007) introduce Voronoi random fields as a tech- 
nique for mapping the topological structure of indoor environments. Finally, Mar- 
tinez Mozos et al. (2005) use AdaBoost to create a semantic classifier to classify free 
cells in occupancy maps. 

Also the problem of recognizing objects from 3D data has been studied inten- 
sively. Osada et al. (2001) propose a 3D object recognition technique based on shape 
distributions. Additionally, Huber et al. (2004) present an approach for parts-based 
object recognition. Boykov and Huttenlocher (1999) propose an object recognition 
method based on Markov random fields. Finally, Anguelov et al. (2005) present an 
associative Markov network approach to classify 3D range data. This paper is based 
on our previous work (Triebel et al. (2007)) which introduces the instance-based 
associative Markov networks. 



3 Collective classification 

In most standard spatial classification methods, the label of a data point only depends 
on its local features but not on the labeling of nearby data points. However, in practice 
one often observes a statistical dependence of the labeling associated to neighboring 
data points. Methods that use the information of the neighborhood are denoted as 
collective classification techniques. In this work, we use a collective classifier based 
on associative Markov networks (AMNs) (Taskar et al. (2004)), which is improved 
with an instance-based nearest-neighbor (NN) approach. 

3.1 Associative Markov networks 

An associative Markov network is an undirected graph in which the nodes are rep- 
resented by N random variables In our case, these random variables are 

discrete and correspond to the semantic label of each of the data points pi, . . . ,pv, 
each represented by a vector Xj G of local features. Additionally, edges have asso- 
ciated a vector x^ of features representing the relationship between the correspond- 
ing nodes. Each node y,- has an associated non-negative potential (p(x;,}>,). Similarly, 
each edge (yi,yj) has a non-negative potential \|/(x,',-,y,-,yy) assigned to it. The node 
potentials reflect the fact that for a given feature vector x, some labels are more likely 
to be assigned to p, than others, whereas the edge potentials encode the interactions 
of the labels of neighboring nodes given the edge features x,y. Whenever the poten- 
tial of a node or edge is high for a given label y; or a label pair (y,-,y,), the conditional 
probability of these labels given the features is also high. The conditional probability 
that is represented by the network is expressed as: 
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P(y I x) = \v{xij,yi,yj), (1) 

i'=l (ij)eE 

where the partition function Z = X)y/ (p(xi,3'D n{,/)G£ V(x,;, - 

The potentials can be defined using the log-linear model proposed by Taskar 
et al. (2004). However, we use a modification of this model in which a weight vector 

C is introduced for each class label k = I,. .. ,K. Additionally, a different 

k I 

weight vector We’ , with k = y, and I = yj is assigned to each edge. The potentials are 
then defined as: 

K 

log(p(x,',y,) = Xi)y\ (2) 

k=l 
K K 

\og\\l{xij,yi,yj) = (3) 

k=l 1=1 

where y^ is an indicator variable which is 1 if point p, has label k and 0, otherwise. 
In a further refinement step in our model, we introduce the constraints = 0 for 
k ^ I and > 0. This results in \|/(x,j,k,Z) = 1 for k ^ Z and \\f{xij,k,k) = 
where XP > 1 . The idea here is that edges between nodes with different labels are 
penalized over edges between equally labeled nodes. 

If we reformulate Equation 1 as the conditional probability Pw{y \ x), where the 
parameters to are expressed by the weight vectors w = (w„,We), and plugging in 
Equations (2) and (3), we then obtain that logPw(y | x) equals 

N K K 

^^(w^x,)yf+ ^ ^(w^’^.x,y)yfy^-logZ„(x). (4) 

1=1 k=l (ij)eE k=l 

In the learning step we try to maximize Pw(y I x) by maximizing the margin 
between the optimal labeling y and any other labeling y (Taskar et at. (2004)). This 
margin is defined by: 



logPco(y |x)-logP(fl(y |x). (5) 

The inference in the unlabeled data points is done by finding the labels y that 
maximize logPw(y | x). We refer to Triebel et al. (2007) for more details. 

3.2 Instance-based AMNs 

The main drawback of the AMN classifier explained previously, which is based on 
the log-linear model, is that it separates the classes linearly. This assumes that the 
features are separable by hyper-planes, which is not justified in all applications. This 
does not hold for instance-based classifiers such as the nearest-neighbor (NN), in 
which a query data point p is assigned to the label that corresponds to the training 
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data point p whose features x are closest to the features x of p. In the learning step, 
the NN classifier simply stores the entire training data set and does not compute a 
reduced set of training parameters. 

To combine the advantage of instance-based NN classification with the AMN 
approach, we convert the feature vector x of the query point p using the transform 
X : x(x) = (<f(x,xi),. . . ,<i(x,XK)), where K is the number of classes and 

Xk denotes the training example with label k closest to x. The transformed features 
are more easily separable by hyperplanes. Additionally, the N nearest neighbors can 
be used in the transform function. 



4 Feature extraction in 2D maps 

In this paper, indoor environments are represented by two dimensional occupancy 
grid maps (Moravec (1988)). The unoccupied cells of a grid map form an 8-connected 
graph which is used as the input to the iAMN. Each cell is represented by a set of 
single-valued geometrical features calculated from the 360° laser scan in that partic- 
ular cell as shown by Martinez Mozos et al. (2005). 

Three dimensional scenes are presented by point clouds which are extracted with 
a laser scan. For each 3D point we computed spin images (Johnson (1997)) with a 
size of 5 X 10 bins. The spherical neighborhood for computing the spin images had 
a radius between 10 and 15cm, depending on the resolution of the input data. 



5 Feature selection 

One of the problems when classifying points represented by range data consists in 
selecting the size L of the features vectors x. The number of possible features that 
can be used to represent each data point is usually very large and can easily be in 
the order of hundreds. This problem is known as curse of dimensionality. There are 
at least two reasons to try to reduce the size of the feature vector. The most obvious 
one is the computational complexity, which in our case, is also the more critical. We 
have to learn an inference in networks with thousands of nodes. Another reason is 
that although some features may carry a good classification when treated separately, 
maybe there is a little gain if they are combined together if they have a high mutual 
correlation (Theodoridis and Koutroumbas (2006)). 

In our approach, the size of the feature vector for 2D data points is of the order 
of hundreds. The idea is to reduce the size of the feature vectors when used with the 
iAMN and at the same time try to maintain their class discriminatory information. To 
do this we apply a scalar feature selection procedure which uses a class separability 
criterion and incorporates correlation information. As separability criterion C, we use 
the Fisher’s discrimination ratio (FDR) extended to the multi-class case (Theodoridis 
and Koutroumbas (2006)). For a scalar feature / and K classes {wi,. . . ,wk}, C{f) 
can be defined as: 
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C(f) = FDR, = 17^ • «■> 

i Is 

where the subscripts i,j refer to the mean and variance of the classes w,- and Wj 
respectively. Additionally, the cross-correlation coefficient between any two features 
/ and g given T training examples is defined as: 



P/^ 






E T 2 






(7) 



where x,f denotes the value of the feature / in the training example t. Finally, the 
selection of the best L features involves the following steps: 



• Select the first feature f\ as /i = argmax^C(/). 

• Select the second feature /2 as: 

/2 = argmax { ai C(/) - tt 2 1 p/i / 1 } , 
mi 



where ai and a .2 are weighting factors. 
• Select fi, I = 1, . . . ,L, such that: 



fi = argmax 

mr 



aiC(f) 



tt2 

Tm 




r=l, 2,. 



6 Experiments 

The approach described above has been implemented and tested in several 2D maps 
and 3D scenes. The goal of the experiment is to show the effectiveness of the iAMN 
in different indoor range data. 

6.1 Classification of places in 2D maps 

This experiment was carried out using the occupancy grid map of the building 79 at 
the University of Freiburg. For efficiency reasons we used a grid resolution of 20cm, 
which lead us to a graph of 8088 nodes. The map was divided into two parts, the left 
one used for learning, and the right one used for classification purposes (Figure 1). 
For each cell we calculate 203 geometrical features. This number was reduced to 30 
applying the feature selection of Section 5. The right image of Figure 1 shows the 
resulting classification with a success rate of 97.6%. 
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Fig. 1. The left image depicts the training map of building 79 at the University of Freiburg. 
The right image shows the resulting classified map using an iAMN with 30 selected features. 



6.2 Classification of objects in 3D scenes 

In this experiment we classify 3D scans of objects that appear in a laboratory of 
the building 79 of the University of Freiburg. The laboratory contain tables, chairs, 
monitors and ventilators. For each object class, an iAMN is trained with 3D range 
scans each containing just one object of this class (apart from tables, which may have 
screens standing on top of them). Figure 2 shows three example training objects. A 
complete laboratory in the building 79 of the University of Freiburg was later scanned 
with a 3D laser. In this 3D scene all the objects appear together and the scene is used 
as a test set. The resulting classification is shown in Figure 3. In this experiment 
76.0% of the 3D points where classified correctly. 

6.3 Comparison with previous approaches 

In this section we compare our results with the ones obtained using other approaches 
for place and object classification. First, we compare the classification of the 2D map 
when using a classifier based on AdaBoost as shown by Martinez Mozos et al. (2005). 
In this case we obtained a classification rate of 92.1%, in contrast with the 97.6% ob- 
tained using iAMNs. We believe that the reason for this improvement is the neighbor- 
ing relation between classes, which is ignored when using the AdaBoost approach. In 
a second experiment, we compare the resulting classification of the 3D scene with the 
one obtained when using AMN and NN. As we can see in Table 1, iAMNs perform 
better than the other approaches. A posterior statistical analysis using the r-student 
test indicates that the improvement is significant at the 0.05 level. We additionally 
realized different experiments in which we used the 3D scans of isolated objects for 
training and test purposes. The results are shown in Table 1 and they confirm that 
iAMN outperform the other methods. 



7 Conclusions 

In this paper we propose a semantic classification algorithm that combines associa- 
tive Markov networks with an instance-based approach based on nearest neighbor. 
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Fig. 2. 3D scans of isolated objects used for training: a ventilator, a chair and a table with a 
monitor on top. 




Fig. 3. Classification of a complete 3D range scan obtained in a laboratory at the University 
of Freiburg. 



Table 1. Classification results in 3D data 



Data set 


NN 


AMN 


iAMN 


Complete scene 


63% 


62% 


76% 


Isolated objects 


81% 


72% 


89% 



Furthermore, we show how this method can be used to classify points described by 
features extracted from 2D and 3D laser scans. Additionally, we present an approach 
to reduce the number of features needed to represent each data point, while main- 
taining their class discriminatory information. Experiments carried out in 2D and 3D 
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maps demonstrated the effectiveness of our approach for semantic classification of 
places and objects in indoor environments. 
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Abstract. Theory often suggests spatial correlations without being explicit about the exact 
form. Hence, econometric tests are used for model choice. So far, mainly Lagrange Multiplier 
tests based on ordinary least squares residuals are employed to decide whether and in which 
form spatial correlation is present in Cliff-Ord type spatial models. In this paper, the model 
selection is based both on likelihood ratio and Wald tests using estimates for the general model 
and information criteria. The results of the conducted large Monte Carlo study suggest that 
Wald tests on the spatial parameters after estimation of the general model are the most reliable 
approach to reveal the nature of spatial correlation. 



1 Introduction 

Theoretical considerations frequently suggest proximity and/or similarity between 
observational units as important determinant. Econometric models trying to capture 
the proximity and/or similarity are referred to as ’spatial models’. Spatial models are 
nowadays employed widely. Spatial correlation can have numerous reasons, e.g. in- 
teraction between cross-sectional units could be due to environmental circumstances, 
network externalities, market interdependencies, strategic effects such as tax set- 
ting behavior and vote seeking behavior, contagion problems, population and em- 
ployment growth, or the determinants of welfare expenditures. For a state-of-the-art 
overview, see the book by Anselin, Florax and Rey (2004). A Google Scholar search 
with the words ’spatial correlation cliff ord’ lead to 1,770 hits. This kind of spatial 
models capture the proximity between observational units either by introducing a 
spatially lagged (endogenous or exogenous) variable or by modeling spatial corre- 
lation in the error term. In either way it is necessary to specify a weighting scheme 
which specifies the proximity or similarity. A common example for the former is 
the inverse distance between the capitals, whereas for the latter the membership in 
regional trade groups or the common language are examples. 
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In most cases theory is silent about the explicit functional form of the spatial 
interaction. In many applications modeling either a spatial lag in the endogenous 
variable and/or a lag in the error term cooperates with the theory. Including both, a 
spatially lagged endogenous variable and spatial correlation in the error term, may 
therefore be useful in order to obtain white noise errors and valid hypothesis tests 
for the regression parameters. The spatial autoregressive model with spatial autore- 
gressive disturbances is then an obvious model to start with. However, this general 
model is so far not considered as the starting point for model selection/specificatlon. 

For the choice of the econometric model, there are basically two different ap- 
proaches that can be employed: the ’bottom-up’ or the ’top-down’ approach. In the 
spatial econometric literature the classical specification search approach has been 
predominant, which is the ’specific to general’ or ’bottom-up’ approach. First a 
model without spatially lagged variables is estimated. Afterwards, Lagrange Mul- 
tiplier (LM) tests for the spatial error model or the spatial lag model using ordinary 
least squares (OLS) residuals are employed to decide whether spatial correlation is 
present or not. If the null hypothesis of a test for a spatial autoregressive process 
is rejected, a spatial variant is calculated (see Florax et al. (2003)). Florax and de 
Graaff (2004) suggest to rely on the ad hoc decision rule that whichever test statistic 
is greater and significantly different form zero, points to the right alternative. Note, 
however, that LM tests for the spatial error and the spatial lag model exhibit power 
against both alternatives. 

The second approach is a ’general to specific’ or ’top-down’ approach put for- 
ward by Hendry (1979), and in spatial econometrics by Florax et al. (2003). The 
’top-down’ approach starts with a very general model that allows for spatial cor- 
relation among various variables. A sequence of specification tests progressively 
simplifies the model. We propose to use the ’top-down’ approach with the spatial 
autoregressive model with spatial autoregressive disturbances as the general model. 
The appropriateness of this approach is shown in a large Monte Carlo study, using 
maximum likelihood (ML) and generalized method of moments (GMM) estimators. 



2 Model and test statistics 

We describe the estimation approaches for the spatial autoregressive model with spa- 
tial autoregressive disturbances (henceforth short SARAR(1,1)), i.e. in our case the 
most general model. The estimation procedure for the other models are then eas- 
ily obtained by implying the restriction p = 0 for the spatial error model (abbrevi- 
ated by SARAR(0,1)) and X = 0 for the spatial lag model (denoted subsequently 
by SARAR(1,0)). We restrict ourself to these classes of model choice and do not 
consider other possible functional forms or misspecifications (see for an analysis of 
misspecification resulting form an improper weighting matrix Dubin (2003) and for 
misspecification concerning the functional form McMillen (2003)). 

The data generating process (DGP) for the SARAR(1,1) model considered In our 
study is given by: 



y=pWy+X^+u, u=XWu+e, 



( 1 ) 
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where y is the m x 1 vector of the dependent variable, n is the sample size, X is the 
nxk matrix of the independent variables, k is the number of independent variables, 
(3 is the A: X 1 vector of coefficients, W is a given nx n weighting matrix, p is the 
coefficient of the spatially lagged dependent variable, X is the spatial error correlation 
coefficient, and e is the n x 1 disturbance term. The disturbances £,■ (f = 
are assumed to be i.i.d.{0,a^) with finite second and fourth moments. Further we 
assume that all diagonal elements of the row normalized weighting matrix W are 
zero, the absolute values of p and X are less than 1, and thus the matrices (/— pW) 
and (7 — pW ) are nonsingular. 

2.1 Estimation approaches 

We use two different approaches to estimate our models: (i) Maximum Likelihood, 
and (ii) GMM. For the maximum likelihood estimator two of the first order condi- 
tions are employed to get the concentrated log-likelihood function LLc = 
Fkt{p,X\X,y). This is a non-linear function in the two parameters p and X (Anselin 
(1988a)). The standard errors of all the estimators are obtained via the information 
matrix. 

The second approach is based on generalized method of moments (GMM). The 
GMM estimator is a two-stage least squares procedure that uses additional moment 
conditions to estimate the spatial parameters. To account for the endogeneity of Wy, 
all independent variables as well as the once and twice spatially lagged independent 
variables ([X, WX,W^X]) serve as instruments as recommended by Kelejian and 
Prucha (1999). Kelejian and Prucha (1999) proposed a three-step procedure. In the 
first step a consistent estimate for the residuals is obtained by two-stage least squares 
(2SLS). These residuals are used in the moment conditions to estimate the spatial cor- 
relation coefficient of the error term. In the final step, a Cochrane-Orcutt type trans- 
formation is applied and the parameters are estimated by 2SLS on the transformed 
values. Lee (2003) proved that these instruments do not lead to asymptotically effi- 
cient parameter estimates. He suggests to use H = (I — XW)[X,W(I — pW)^*X(3] as 
instruments, where 3 are the estimates from the first-step regression. We apply these 
optimal instruments by replacing p and X with their estimates from the first step. 
The standard errors for the regression coefficients and the coefficient of the spatially 
lagged dependent variable are readily obtained from the last stage regression. How- 
ever, in order to obtain the standard error for the spatial error parameter, we have to 
apply the estimator suggested in Kelejian and Prucha (2006). 

2.2 Applied tests for model selection 

First the ’specific to general’ approaches are described. These tests start from the 
most simple model and turn to more complicated ones if the test statistic rejects the 
simple model. In the applied framework the most simple model is one without spatial 
lag and spatial error, i.e. OLS regression. 

Available tests are mainly LM tests, which only rely on the estimates of the 
model unter the null hypothesis. Basically the LM tests suggested by Anselin et 
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al. (1996) are implemented. As the LM tests are based on OLS resiudals, u denotes 
the estimated residuals from the OLS regression, and = (l/n)u'u. Further, we 
have to distinguish whether we assume the second spatial parameter to be zero or 
not. The following definitions will simplify the expressions: T = tr{{W' + W)W), 

M = I - X(X'X) ^ ‘ X', /p 3 = ^ [( WXP)'M(WXP) + Ta^] . Now the following tests 
can be conducted: 



Model: 

LMx 

Model: 

LMp 

Model: 

LM{ 

Model: 

lm; 



y = XP + M, assumption: p = Q,Hq :X = 0 
{u'Wuja^)^ 

f ■ 

y = XP + M, assumption: X = 0, //q : p = 0 

{u'Wy/a^Y 

nJp^ 

y = pWy + XP + M,//o :X = 0 
[ii'Wu/a^ — T{nJp^)^^u'Wy/a^Y 
T[l-r(n/pp)]-l • 

y = XP + M,M = XVFm + E,//o : P = 0 
[ii'Wyla^ — u'Wu/a^Y 
n/pp - T 



( 2 ) 

(3) 

(4) 

(5) 



LM tests for p and X in the case of spatial correlation in the error term or in the 
dependent variable respectively, which are assumed to be estimated, were derived by 
Anselin (1988b): 



LMt = 

^ 722 - {T2iA?var{p) 

LmA = , 

P Hrho-HQpVar{Q)H'^^ 



( 6 ) 

(7) 



where Tju = tr[W 2 W lA-'^ ^W'^W iA-\A = I-pWi,Q'= (P'Xg12)', B = I-XW 2 , 
Hp = trW^ + tr{BWB-^Y(BWB-^) + ^{BWX^)' (BITXP) and = 

/ ^(7?X)'BlTXp 

tr{WB-^)'BWB-^ +trWWB 

V 0 

for the parameter vector 0 in the null model. 

Besides the described LM test we calculate likelihood ratio (LR) tests. There- 
fore one needs to calculate both the restricted and the unrestricted model, i.e. LR = 
— 2(LLr — T'T'ur)^ where LLu^ (LLr) denotes the value of the maximized log-likelihood 
of the unrestricted (restricted) model. 

Third, we calculate Wald tests for both the MLE and the GMM approach. The 
SARAR(1,1) model has to be estimated in order to test against more sparse variants. 
Hence, these tests are in the vein of a ’general to specific’ methodology. Given the 



^ , with v3r(0) as the estimated variance matrix 
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estimates for the general model, we can conduct the Wald test for p and X: Wp = 
p/dp and Wx = X/&x, where p and X are the estimates of the general model under 
consideration, and Gp and Ux are the estimated standard errors thereof. Note that with 
the estimates of the SARAR(1,1) model we can conduct a test for joint significance 
of p and X for both, the MLE and the GMM estimators. 

Fourth, widely used information criteria are implemented in order to obtain the 
true DGR The Akaike information criterion (AIC), the bias corrected Akaike crite- 
rion (AICc), and the Schwartz information criterion (BIG) are calculated (e.g., Belitz 
and Lang (2006)). 



3 Monte Carlo study 

All test evaluations are done using a sample size of 400. The regression coefficient 
vector 3 is set to be ( 1 , 1 ) . The independent variable is drawn randomly from the 
uniform distribution between zero and twenty. The remainder noise is normally dis- 
tributed with mean zero and variance one. For each setting of the true DGP 1000 
Monte Carlo data sets are calculated which leads to a 95% confidence interval for 
the nominal significance level of 5%± 1.35%. 

Two different weighting schemes are employed. The units are ordered regularly 
in a square grid of size ^/n x ^/n. The first weighting matrix uses the Moore (Queen, 
e.g., Anselin (1988b)) neighborhood with radius r= 1. After row normalizing the 
matrix, the weighting matrix W is obtained, and denoted henceforth as Wi . As sec- 
ond weighting matrix (W 2 ) the distance dij between observation units i and j is com- 
puted and the elements of the weighting matrix are calculated as I /dij if i ^ j. The 
diagonal elements are set to zero. In order to limit the neighboring influence, addi- 
tionally the elements of the weighting matrix are set to zero if the distance is greater 
than 7.1 (which corresponds to a radius of 5). Hence, the weighting matrix based on 
the Moore neighborhood (WQ is sparser and demonstrates less spatial connectivity 
than the one based on the distance (W 2 ). 

In order to obtain the power function the true spatial correlation parameters 
are varied in the following way (p,X) = (0,0.5), (0.05,0.5), (0.1, 0.5), (0.15,0.5), 
(0.2,0.5), (0.5,0), (0.5, 0.05), (0.5, 0.1), (0.5, 0.15), (0.5, 0.2). 

4 Results 

Let us first analyze the experiments with SARAR(1,1) as true DGR In order to obtain 
the size and the power of the Wald test the spatial parameter X (p) is fixed at the value 
of 0.5. The second spatial parameter p (X) is varied from 0 to 0.2. The actual size 
of the Wald test with the null hypothesis Hq : p = 0 (Hp : X = 0) is not significantly 
different from the nominal size of 5%. The joint hypothesis test supports the alter- 
native hypothesis with 100% as well as the Wald test for the corresponding second 
spatial parameter X(p). Hence, the correct more parsimonious model under the null 
hypothesis is detected accordingly. 
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When the null hypothesis is tested whether the spatial correlation parameter p 
is zero in the presence of spatial correlation in the error term the Wald test has a 
very good power although the power is higher with the sparser weighting matrix. 
The latter characteristic is a general feature for all conducted tests. The power for 
the spatial error parameter in the presence of a non-zero spatial lag parameter is 
lower. However, the power of the Wald test in this circumstances is (much) greater 
than the power achievable by using Lagrange Multiplier tests. In Figure Ic the best 
performing LM test is plotted, i.e. LM^. All LM tests relying on OLS residuals fail 
seriously to detect the true DGP. 

Comparable to the performance of the Wald test based on MLE estimates is the 
Wald test based on GMM estimates but only in detecting the significant lag parameter 
in the presence of a significant spatial error parameter. In the reverse case the Wald 
test using GMM estimates is much worse. 

As a further model selection approach the performance of information criteria 
is analyzed. The performance of the classical Akaike information criterion and the 
bias corrected AIC are almost identical. In Figure Id the share of cases in which 
AIC/AICc identifies the correct DGP is plotted on the y-axis. All Information cri- 
teria fail in more than 15% of the cases to identify the correct more parsimonious 
model, i.e. SARAR(1,0) or SARAR(0,1) instead of SARAR(1,1). However, in the 
remaining experiments (p = 0.05, ...,0.2 or X = .05, ...,0.2) AIC/AICc is compara- 
ble to the performance of the Wald test. BIG performs better than AIC/AICe to detect 
SARAR(1,0) or SARAR(0,1) instead of SARAR(1,1) but much worse in the remain- 
ing experiments. 

In order to be able to propose a general procedure for model selection the ap- 
proach must also be suitable if the true DGP is SARAR(1,0) or SARAR(0,1). In this 
case the Wald test based on the general model has again the appropriate size and a 
very good power. Further the sensitivity on different weighting matrices is less se- 
vere. However, the power is smallest for the test with the null hypothesis Hq :X = 0 
and with distance as weighting scheme W 2 . The Wald test using GMM estimates is 
again comparable when testing for the spatial lag parameter but worse when testing 
for the spatial error parameter. 

Not significantly different from the power function of the Wald test based on the 
general model are both LM statistics based on OLS residuals. However, In this case 
LM^ fails to identify the correct DGP. 

The Wald test outperforms the information criteria regarding the identihcation of 
SARAR(1,0) or SARAR(0,1). If OLS is the DGP, the correct model is chosen only 
about two thirds of the time by AIC/AICe but comparably often to Wald by BIG. If 
SARAR(1,0) is the data generating process all information criteria perform poorer 
than the Wald test independent of the underlying weighting scheme. If the 
SARAR(0,I) is the data generating process, BIG is worse than the Wald test, and 
AIC/AICc has a slightly higher performance for small values of the spatial parame- 
ter but is outperformed by the Wald test for higher values of the spatial parameters. 
For the sake of completeness it is noted that no valid model selection can be con- 
ducted using likelihood ratio tests. 
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a) SARAi=l(1,1>: MLE Wald 



b) SARAR(1 ,1): GMM opt.inst. Wald 





Fig. 1. a) Power of the Wald test based on the general model and MLE estimates, b) Power 
of the Wald test based on the general model and GMM estimates, c) Power of the Lagrange 
Multiplier test using LM^ as test statistic, d) Correct model choice of the better performing 
information criterion (AIC/AICc). 



To conclude, we find that the ’general to specific’ approach is the most suitable 
procedure to identify the correct data generating process (DGP) regarding Cliff-Ord 
type spatial models. Independent whether the true DGP is a SARAR(1,1), 
SARAR(1,0), SARAR(0,1), or just a regression model without any spatial corre- 
lation, the general model should be estimated and the Wald tests conducted. The 
chance to identify the true DGP is than higher compared to the alternative model 
choice criteria based on the LM tests, LR tests or on information criteria like AIC, 
AICc or BIG. 
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Abstract. The term of Urban Data-Mining is defined to describe a methodological approach 
that discovers logical or mathematical and partly complex descriptions of urban patterns and 
regularities inside the data. The concept of data mining in connection with knowledge discov- 
ery techniques plays an important role for the empirical examination of high dimensional data 
in the field of urban research. The procedures on the basis of knowledge discovery systems 
are currently not exactly scrutinised for a meaningful integration into the regional and urban 
planning and development process. In this study ESOM is used to examine communities in 
Germany. The data deals with the question of dynamic processes (e.g. shrinking and growing 
of cities). In the future it might be possible to establish an instrument that defines objective 
criteria for the benchmark process about urban phenomena. The use of GIS supplements the 
process of knowledge conversion and communication. 



1 Introduction 

Comparisons of cities and typological grouping processes are methodical instru- 
ments to develop statistical scales and criteria about urban phenomena. Harris started 
in 1943, who ranked US cities according to industrial specialization data; many of 
the other studies that followed added occupational data to the classification models. 
Later on, in the 1970s, classification studies were geared to measuring social out- 
comes and shifted more towards the goals of public policy. Forst (1974) presents 
an investigation of german cities by using social and economic variables. In Great 
Britain, Craig (1985) employed a cluster analysis technique to classify 459 local 
authority districts, based on the 1981 Census of Population. Hill et al. (1998) classi- 
fied US cities by using the city’s population characteristics. Most of the mentioned 
classification studies use economic, social, and demographic variables as a basis 
for their classifications which are usually calculated by hierarchical algorithms (e.g. 
WARD, K-Means). Geospatial objects are analysed by Demsar (2006). These former 
approaches of city classification are summarized in Behnisch (2007). 

The purpose of this article is to find groups (clusters) of communities with the 
same dynamic characteristics in Germany (e.g. shrinking and growing of cities). 
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The Application of Emergent Self Organizing Maps (ESOM) and the corresponding 
U*C- Algorithm is proposed for the task of City Classification. The term of Urban 
Data Mining (Behnisch, 2007) is defined to describe a methodological approach that 
discovers logical or mathematical and partly complex descriptions of urban patterns 
and regularities inside the data. The result can suggests a general typology and can 
lead to the development of prediction models using subgroups instead of the total 
population. 



2 Inspection and transformation of data 

Eour variables were selected for the classification analysis. The variables characterise 
a city’s dynamic behaviour. The data was created by the German BBR (Eederal Of- 
fice for Building and Regional Planning) and refers to the statistics of inhabitants 
(El), migration (V 2 ), employment (E 3 ) and mobility (V 4 ). The dynamic processes are 
characterised by positive or negative percentage quotations between the year 1999 
and 2003. The inspection of data includes the visualisation in form of histograms, 
QQ-Plots, PDE-Plots (Ultsch, 2003) and Box-Plots. The authors decided to use trans- 
formation measurements such as ladder of power to take into account restrictions of 
statistics (Hand et ak, 2001 or Ripley, 1996). Elgure 1 and Eigure 2 show an example 
for the distribution of variables. As a result of pre-processing the authors find a mix- 
ture of two distributions with decision boundary zero in each of the four variables. 
All variables are transformed by using Slog(x) = sign (x) • log(|x| -t- 1). 




Fig. 1. QQ-Plot(inhabitants) Fig. 2. PDE-Plot(Sloginhabitants) 



The first hypothesis to the distribution of each variable is a bimodal distribution 
of lognormal distributed data (Data > 0: skewed right. Data < 0: skewed left). 

The result of the detailed examination is summarized in Table 1. The data follows 
a lognormal distribution. Decision boundaries will be used to form a basis for a 
manual classification process and support the interpretation of results. 

Pertaining to the classification approach (e.g. U*-Matrix and subsequent U*C- 
Algorithm) and according to the Euclidian Distance the data need to be standardized. 
Eigure 3 shows Scatter-Plots of the transformed variables. 
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Table 1. Examination of the four distributions 



Variable 


Slog(Data) 


Decision Boundaries 


Size of Classes 


inhabitants 


bimodal distribution 


Cl: Data < 0 
C2: Data > 0 


[5820], 46,82% 
166101,53,18% 


migration 


bimodal distribution 


Cl: Data < 0 
C2: Data > 0 


[4974], 40,02% 
[7456], 59,98% 


employment 


bimodal distribution 


Cl: Data < 0 
C2: Data > 0 


[7492], 60,27% 
[4938], 39,73% 


mobility 


multimodal distribution 


Cl: Data < 0 
C2: 0 < Data < 50 
C3: Data > 50 


[2551], 20,52% 
[9317], 74,96% 
[562], 4,52% 




Fig. 3. Scatter-Plots of transformed variables 



3 Method 

In the field of urban planning and regional science data are usually multidimensional, 
spatially correlated and especially heterogeneous. These properties make classical 
data mining algorithms often inappropriate for this data, as their basic assumptions 
cease to be valid. The power of self-organization allows the emergence of structure 
in data and supports visualization, clustering and labelling concerning a combined 
distance and density based approach. To visualize high-dimensional data, a projec- 
tion from the high dimensional space onto two dimensions is needed. This projection 
onto a grid of neurons is called SOM map. There are two different SOM usages. The 
first are SOM, introduced by Kohonen (1982). Neurons are identified with clusters 
in the data space (k-means SOM) and there are very few neurons. The second are 
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SOM where the map space is regarded as a tool for the visualization of the otherwise 
high dimensional data space. These SOM consist of thousands or tens of thousand 
neurons. Such SOM allow the emergence of intrinsic structural features of the data 
space and therefore they are called Emergent SOM (Ultsch, 1999). The map of an 
ESOM preserves the neighbourhood relationships of the high dimensional data and 
the weight vectors of the neurons are thought as sampling point of the data. The U- 
Matrix has become the canonical tool for the display of the distance structures of 
the input data on ESOM. The P-Matrix takes density information into account. The 
combination of U-Matrix and P-Matrix leads to the U*Matrix. On this U*-Matrix a 
cluster structure in the data set can be detected directly. Compare the examples in 
Eigure 4 using the same data to see in an appropriate way, whether there are cluster 
structures. 




Fig. 4. K-Means-SOM by Kaski et al. (1999), left and U*-Matrix, right 



The often used hnite grid as map has the disadvantage that neurons at the rim of 
the map have very different mapping qualities compared to neurons in the centre vs. 
the border. This is important during the learning phase and structures the projection. 
In many applications important clusters appear in the corner of such a planar map. 
Using ESOM methods for clustering has the advantage of a nonlinear disentangle- 
ment of complex structures. 

The clustering of the ESOM can be performed at two different levels. The Best- 
match Visualization can be used to mark data points that represents a neuron with a 
defined characteristic. Bestmatches and thus corresponding data points can be man- 
ually grouped into several clusters. Not all points need to be labelled, outliers are 
usually easily detected and can be removed. Secondly the neurons can be clustered 
by using a clustering algorithm, called U*C, which is based on grid projections and 
uses distance and density information (Ultsch (2005)). In most times an aggregation 
process of objects is necessary to build up a meaningful classihcation. Assigning a 
name to a cluster is one of the most important processes in order to define the mean- 
ing of a cluster. The interpretation is based on the attribute values. Moreover it is 
possible to integrate techniques of Knowledge Discovery to understand the structure 
in a complementary form and support the finding of an appropriate cluster denomina- 
tion. Examples are the symbolic algorithms such as SIG* or U-Know (Ultsch (2007)) 
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which lead to significant properties for each cluster and a fundamental knowledge 
based description. 



4 Results 

A first classification is based on the dichotomic characteristics of the four variables. 
24 Classes are detected by using the decision boundaries (Variable > 0 or Variable 
< 0). The further aggregation leads to the five classes of Table 2. The classed are con- 
tent adressed to the approved pressure factors for urban dynamic development (pop- 
ulation and employment). The purpose of such a wise classification was to sharpen 
characteristics and to find a special label. 



Table 2. Classes of Urban Dynamic Phenomena 



Label 


Inhabitants 


Migration 


Employment 


Shrinking of Inhabitants and Employment 


low 


low 


low 


Shrinking but influx 


low 


high 


low 


Growing of Employment 


low 




high 


Growing of Inhabitants 


high 




low 


Growing of Inhabitants and Employment 


high 




high 



An ESOM with 50x82 neurons is trained with the pre-processed data to proof 
the defined structure. The corresponding U*-Map delivers a geographical landscape 
of the input data on to a projected map (imaginary axis). The cluster boundaries are 
expressed by mountains that means the value of height defines the distance between 
different objects which is displayed on the z-Axis. A valley describes similar objects, 
characterized by small U-heights on the U*-Matrix. Data points found in coherent 
regions are assigned to one cluster. All local regions lying in the same cluster have 
the same spatial properties. 

The U*-Map (Island View) can be seen in Figure 5 in connection to the U*- 
Matrix of Figure 6 including the clustering results of U*C-Algorithm with 1 1 classes. 
The existing clusters are described by the U-Know Algorithm and the symbolic de- 
scription is comparable to the dichotomic properties. The interpretation of the clus- 
tering results leads finally to the same five main classes realized by the content-based 
aggregation. It is remarkable that the structure of the first classification can be rec- 
ognized by using later Emergent SOM. 

Figure 7 determines the five main cluster solution and displays the spatial struc- 
ture of the classified objects. It is obvious to see that growing processes can be found 
in the southern and western part of Germany and shrinking processes can be local- 
ized in the eastern part. Shrinking processes also exist in areas of traditional coal and 
steel industry. 
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80 

Fig. 5. U*-Map (Island View) 




Fig. 6. U*Matrix and Result of U*-C-Algorithm 



5 Conclusion 

The authors present a classification approach in connection with geospatial data. The 
central issue of the grouping processes are the shrinking and growing phenomena in 
Germany. First the authors examine the pool of data and show the importance for the 
investigation of distributions according to the dichotomic properties. Afterwards it is 
shown that the use of Emergent SOMs is an appropriate method for clustering and 
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■ Shrinking of inhabitants ■ Growing of Empioyment 
and Empioyment 

□ Shrinking but influx ■ Growing of Inhabitants 

□ Not Classified ■ Growing of Inhabitants 

and Employment 

Fig. 7. Localisation of Shrinking and Growing Municipalities in Germany 



classification. The advantage is to visualize the structure of data and later on to define 
a number of feasible cluster using U*C-algorithm or manual bestmatch grouping pro- 
cesses. The application of existing visual methods especially U*-Matrix shows that it 
is possible to detect meaningful classes among a large amount of geospatial objects. 
For example typical hierarchical algorithm would fail to examine 12430 objects. As 
such, the authors believe that the presented procedure of the wise classification and 
the ESOM approach complements the former proposals for city classihcation. It is 
expected that in the future the concept of data mining in connection with knowledge 
discovery techniques will get an increasing importance for the urban research and 
planning processes (Streich, 2005). Such approaches might lead to a benchmark sys- 
tem for regional policy or other strategical institutions. To get more data for a deeper 
empirical examination it is necessary to conduct field investigation in selected areas. 
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Abstract. The Konstanz Information Miner is a modular environment, which enables easy 
visual assembly and interactive execution of a data pipeline. It is designed as a teaching, 
research and collaboration platform, which enables simple integration of new algorithms and 
tools as well as data manipulation or visualization methods in the form of new modules or 
nodes. In this paper we describe some of the design aspects of the underlying architecture and 
briefly sketch how new nodes can be incorporated. 



1 Overview 

The need for modular data analysis environments has increased dramatically over the 
past years. In order to make use of the vast variety of data analysis methods around, it 
is essential that such an environment is easy and intuitive to use, allows for quick and 
interactive changes to the analysis process and enables the user to visually explore 
the results. To meet these challenges data pipelining environments have gathered 
incredible momentum over the past years. Some of today’s well-established (but un- 
fortunately also commercial) data pipelining tools are InforSense KDE (InforSense 
Ltd.), Insightful Miner (Insightful Corporation), and Pipeline Pilot (SciTegic). These 
environments allow the user to visually assemble and adapt the analysis flow from 
standardized building blocks, which are then connected through pipes carrying data 
or models. An additional advantage of these systems is the intuitive, graphical way 
to document what has been done. 

KNIME, the Konstanz Information Miner provides such a pipelining environment. 
Figure 1 shows a screenshot of an example analysis flow. In the center, a flow is 
reading in data from two sources and processes it in several, parallel analysis flows, 
consisting of preprocessing, modeling, and visualization nodes. On the left a reposi- 
tory of nodes is shown. From this large variety of nodes, one can select data sources, 
data preprocessing steps, model building algorithms, as well as visualization tools 
and drag them onto the workbench, where they can be connected to other nodes. The 
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Fig. 1. An example analysis flow inside KNIME. 



ability to have all views interact graphically {visual brushing) creates a powerful en- 
vironment to visually explore the data sets at hand. KNIME is written in Java and its 
graphical workflow editor is implemented as an Eclipse (Eclipse Foundation (2007)) 
plug-in. It is easy to extend through an open API and a data abstraction framework, 
which allows for new nodes to be quickly added in a well-defined way. 

In this paper we describe some of the internals of KNIME in more detail. More 
information as well as downloads can be found at http : / /www . knime . org. 



2 Architecture 

The architecture of KNIME was designed with three main principles in mind. 

• Visual, interactive framework: Data flows should be combined by simple 
drag&drop from a variety of processing units. Customized applications can be 
modeled through individual data pipelines. 

• Modularity: Processing units and data containers should not depend on each other 
in order to enable easy distribution of computation and allow for independent de- 
velopment of different algorithms. Data types are encapsulated, that is, no types 
are predefined, new types can easily be added bringing along type specific ren- 
derers and comparators. New types can be declared compatible to existing types. 

• Easy expandability: It should be easy to add new processing nodes or views and 
distribute them through a simple plugin mechanism without the need for compli- 
cated install/deinstall procedures. 
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In order to achieve this, a data analysis process consists of a pipeline of nodes, con- 
nected by edges that transport either data or models. Each node processes the arriv- 
ing data and/or model(s) and produces results on its outputs when requested. Fig- 
ure 2 schematically illustrates this process. The type of processing ranges from basic 
data operations such as filtering or merging to simple statistical functions, such as 
computations of mean, standard deviation or linear regression coefficients to compu- 
tation intensive data modeling operators (clustering, decision trees, neural networks, 
to name just a few). In addition, most of the modeling nodes allow for an interactive 
exploration of their results through accompanying views. In the following we will 
briefly describe the underlying schemata of data, node, workflow management and 
how the interactive views communicate. 

2.1 Data structures 

All data flowing between nodes is wrapped within a class called DataTable, which 
holds meta-information concerning the type of its columns in addition to the actual 
data. The data can be accessed by iterating over instances of DataRow. Each row 
contains a unique identifier (or primary key) and a specific number of DataCell 
objects, which hold the actual data. The reason to avoid access by Row ID or index is 
scalability, that is, the desire to be able to process large amounts of data and therefore 
not be forced to keep all of the rows in memory for fast random access. KNIME 
employs a powerful caching strategy which moves parts of a data table to the hard 
drive if it becomes too large. Figure 3 shows a UML diagram of the main underlying 
data structure. 

2.2 Nodes 

Nodes in KNIME are the most general processing units and usually resemble one node 
in the visual workflow representation. The class Node wraps all functionality and 




Fig. 2. A schematic for the flow of data and models in a KNIME workflow. 
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makes use of user defined implementations of a NodeModel, possibly a NodeDialog, 
and one or more NodeView instances if appropriate. Neither dialog nor view must be 
implemented if no user settings or views are needed. This schema follows the well- 
known Model- View-Controller design pattern. In addition, for the input and output 
connections, each node has a number of Inport and Outport instances, which can 
either transport data or models. Figure 4 shows a UML diagram of this structure. 

2.3 Workflow management 

Workflows in KNIME are essentially graphs connecting nodes, or more formally, a 
direct acyclic graph (DAG). The Workf lowManager allows to insert new nodes and 
to add directed edges (connections) between two nodes. It also keeps track of the 
status of nodes (configured, executed, ...) and returns, on demand, a pool of exe- 
cutable nodes. This way the surrounding framework can freely distribute the work- 
load among a couple of parallel threads or - in the future - even a distributed cluster 
of servers. Thanks to the underlying graph structure, the workflow manager is able 
to determine all nodes required to be executed along the paths leading to the node 
the user actually wants to execute. 




Fig. 3. A UML diagram of the data structure and the main classes it relies on. 
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Fig. 4. A UML diagram of the Node and the main classes it relies on. 



2.4 Views and interactive brushing 

Each Node can have an arbitrary number of views associated with it. Through re- 
ceiving events from a HiLiteHandler (and sending events to it) it is possible to 
mark selected points in such a view to enable visual brushing. Views can range from 
simple table views to more complex views on the underlying data (e. g. scatterplots, 
parallel coordinates) or the generated model (e. g. decision trees, rules). 

2.5 Meta nodes 

So-called Meta Nodes wrap a sub workflow into an encapsulating special node. This 
provides a series of advantages such as enabling the user to design much larger, 
more complex workflows and the encapsulation of specific actions. To this end some 
customized meta nodes are available, which allow for a repeated execution of the 
enclosed sub workflow, offering the ability to model more complex scenarios such as 
cross-validation, bagging and boosting, ensemble learning etc. Due to the modularity 
of KNIME, these techniques can then be applied virtually to any (learning) algorithm 
available in the repository. 

Additionally, the concept of Meta Nodes helps to assign dedicated servers to this 
subflow or export the wrapped flow to other users as a predefined module. 

2.6 Distributed processing 

Due to the modular architecture it is easy to designate specific nodes to be run on 
separate machines. But to accommodate the increasing availability of multi-core ma- 
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chines, the support for shared memory parallelism also becomes increasingly impor- 
tant. KNIME offers a unified framework to parallelize data-parallel operations. Sieb 
et al. (2007) describe further extensions, which enable the distribution of complex 
tasks such as cross validation on a cluster or a GRID. 



3 Repository 

KNIME already offers a large variety of nodes, among them are nodes for various 
types of data I/O, manipulation, and transformation, as well as data mining and ma- 
chine learning and a number of visualization components. Most of these nodes have 
been specifically developed for KNIME to enable tight integration with the frame- 
work; other nodes are wrappers, which integrate functionality from third party li- 
braries. Some of these are summarized in the next section. 

3.1 Standard nodes 

• Data I/O: generic file reader, and reader for the attribute-relation file format 
(ARFF), database connector, CSV and ARFF writer. Excel spreadsheet writer 

• Data manipulation: row and column filtering, data partitioning and sampling, 
sorting or random shuffling, data joiner and merger 

• Data transformation: missing value replacer, matrix transposer, binners, nominal 
value generators 

• Mining algorithms: clustering (/:-means, sota, fuzzy c-means), decision tree, 
(fuzzy) rule induction, regression, subgroup and association rule mining, neural 
networks (probabilistic neural networks and multi-layer-perceptrons) 

• Visualization: scatter plot, histogram, parallel coordinates, multidimensional scal- 
ing, rule plotters 

• Misc: scripting nodes 



3.2 External tools 

KNIME integrates functionality of different open source projects that essentially cover 
all major areas of data analysis such as WEKA (Witten and Frank (2005)) for ma- 
chine learning and data mining, the R environment (R Development core team (2007)) 
for statistical computations and graphics, and JFreeChart (Gilbert (2005)) for visual- 
ization. 

• WEKA: essentially all algorithm implementations, for instance support vector 
machines, Bayes networks and Bayes classifier, decision tree learners 

• R-project: console node to interactively execute R commands, basic R plotting 
node 

• JFreeChart: various line, pie and histogram charts 

The integration of these tools not only enriches the functionality available in 
KNIME but has also proven to be helpful to overcome compatibility limitations when 
the aim is on using these different libraries in a shared setup. 
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4 Extending KNIME 

KNIME already includes plug-ins to incorporate existing data analysis tools. It is usu- 
ally straightforward to create wrappers for external tools without having to modify 
these executables themselves. Adding new nodes to KNIME, also for native new op- 
erations, is easy. For this, one needs to extend three abstract classes: 

• NodeModel: this class is responsible for the main computations. It requires to 
overwrite three main methods: configure!), execute!), and reset!). The 
first takes the meta information of the input tables and creates the definition of 
the output specification. The execute function performs the actual creation of 
the output data or models, and reset discards all intermediate results. 

• NodeDialog: this class is used to specify the dialog that enables the user to ad- 
just individual settings that affect the node’s execution. A standardized set of 
DefaultDialogComponent objects allows the node developer to quickly create 
dialogs when only a few standard settings are needed. 

• NodeView: this class can be extended multiple times to allow for different views 
onto the underlying model. Each view is automatically registered with a 
HiLiteHandler which sends events when other views have hilited points and 
allows to launch events in case points have been hilit inside this view. 

In addition to the three model, dialog, and view classes the programmer also needs to 
provide a NodeFactory, which serves to create new instances of the above classes. 
The factory also provides names and other details such as the number of available 
views or a flag indicating absence or presence of a dialog. 

A wizard integrated in the Eclipse-based development environment enables con- 
venient generation of all required class bodies for a new node. 



5 Conclusion 

KNIME, the Konstanz Information Miner offers a modular framework, which pro- 
vides a graphical workbench for visual assembly and interactive execution of data 
pipelines. It features a powerful and intuitive user interface, enables easy integration 
of new modules or nodes, and allows for interactive exploration of analysis results or 
trained models. In conjunction with the integration of powerful libraries such as the 
WEKA data mining toolkit and the R-statistics software, it constitutes a feature rich 
platform for various data analysis tasks. 

KNIME is an open source project available at http://www.knime.org. The current 
release version 1.2.1 (as of 14 May 2007) has numerous improvements over the first 
public version released in July 2006. KNIME is actively maintained by a group of 
about 10 people and has more than 6000 downloads so far. It is free for non-profit 
and academic use. 
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Abstract. Most data mining systems follow a data flow and toolbox paradigm. While this 
modular approach delivers ultimate flexibility, it gives the user almost no guidance on the issue 
of choosing an efficient combination of algorithms in the current problem context. In the field 
of Software Engineering the Pattern Based development process has empirically proven its 
high potential. Patterns provide a broad and generic framework for the solution process in its 
entirety and are based on equally broad characteristics of the problem. Details of the individual 
steps are filled in at later stages. Basic research on pattern based thinking has provided us with 
a list of generally applicable and proven patterns. User interaction in a pattern based approach 
to data mining will be divided into two steps: (1) choosing a pattern from a generic list based 
an a handful of characteristics of the problem and later (2) filling in data mining algorithms 
for the subtasks. 



1 Current situation in data mining 

The current situation in the data mining area is characterized by a plethora of algo- 
rithms and variants. The well known WEKA collection (Witten and Frank (2005))im- 
plements approx. 100 different algorithms. However, there is little guidance in select- 
ing and using the appropriate algorithm for the problem at hand as each algorithm 
may also have its very specific strengths and weaknesses. 

As Figure 1 shows for large German companies, the most signifact problems in 
data mining are application issues and the management of the process as a whole 
and not the lack of algorithms (Hippner, Merzenich and Stolz (2002)). Standardizing 
the process as proposed by Fayyad et.al (1996) and later refined into the CRISP- 
DM model (Chapman et.al. (2000)) has resulted in a well established phase model 
with preprocessing, mining and postprocessing steps, but has failed to give hints for 
chosing a proper sequence of processing tools or avoidance of pitfalls. 

Design has always elements of integrated and modular solutions. Integrated solu- 
tions provide us with simplicity, but the lack of adaptability. Modular solutions give 
us the ability to have greater influence on our solution, but ask for more knowledge 




328 



Boris Delibasic, Kathrin Kirchner and Johannes Ruhland 



more staff in ciata mining area 

training lor better urxierstarKiirkg of the 
complex analysis process 

training for better usage of analysis sollwere 
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better sotvtare in the field of ar^alysis tools 
better software in the feld of databases 
tetter hardwere for usage of analysts tools 
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Fig. 1. Proposals for improvement of current data mining projects (46 questionnaires, average 
scores, 1 = no improvement, 5 = highly improvable) 



and human attendance. In reality all solutions are between full modularity and full 
integrality (Eckert and Clarkson (2005)). We believe that for solving problems in the 
data mining area, it is more appropriate to use a modular solution, than an integrated 
one. 

Patterns are meant to be experience packages that give a broad outline on how 
to solve specific aspects of complex problems. Complete solutions are built through 
chaining and nesting of patterns. Thus they go beyond the pure structuring goal. They 
have proven their potential in diverse fields of science. 



2 Introduction to patterns 

Patterns are already very popular in software design as the well known GOF-patterns 
for Object Oriented Design exemplify. (Gamma et.al. (1995)). Patterns we envisage 
are, however, applicable to a much wider context. With the development of pat- 
tern theories in various areas (architecture, IS, tele-communications, organization) 
it seems that also the problems of adaptability and maintenance of DM algorithms 
can be solved using patterns. 

The protagonist of the pattern movement, Cristopher Alexander defines a pattern 
as a three-part rule that expresses the relation between a certain context, a problem 
and a solution. It is at the same time a thing that happens in the world (empirics) 
and a rule that tells us, how to create that thing (process rule) and when to create it 
(context specificity). It is at the same time a process, a description of a thing that is 
alive, and a process that generates that thing (Alexander (1979)). Alexander’s work 
was concentrated in identifying patterns in architecture covering a broad range from 
urban planning to details of interior design. The patterns are shells, which allows 
various realizations, all of which will solve the problem. 
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Fig. 2. Small public square pattern 



We shall illustrate the essence and power of C. Alexander style patterns by two 
examples. On Figure 2 a pattern named Small public squares is presented. Such 
squares enable people in large cities to gather, communicate and develop a commu- 
nity feeling. The core of the pattern is to make such squares not too large, lest they 
will be deserted and look strange to people. 

Another example is shown on Figure 3. The pattern Entrance transition advo- 
cates and enables a smooth transition between the outdoor and indoor space in a 
house. People do not like instant transition. It makes them feel uncomfortable, and 
the house ugly. 




Fig. 3. Entrance transition pattern 



Alexander (2002b) says: 

1. Patterns contain life. 

2. Patterns support each other: the life and existence of one pattern influences the 
life and existence of another pattern. 

3. Patterns are built of patterns, this way their composition can be explained. 

4. The whole (the space in which patterns are implemented to) gets its life depend- 
ing on the density and intensity of the patterns inside the whole. 
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We want to provide the user with the abilty to make data mining (DM) solutions 
by nesting and pipelining of patterns. In that way, the user will concentrate on the 
problems he wants to solve through the deployment of some key patterns. He may 
then nest patterns deep enough to get the job done at the data processing level. Cur- 
rent DM algorithms and DM process paradigms don’t provide users with such an 
ability, as they are typically based on the data flow diagrams approach principle. A 
standard problem solution in the SPSS Clementine system is shown on Figure 4; it 
is a documentation of a chosen solution rather than a solution guide. 







3 Some data mining patterns 

We have already developed some archetypical DM patterns. For their formal repre- 
sentation the J.O. Coplien Pattern Formalization Form has been used (Coplien and 
Zhao (2005), Coplien (1996), p 8). This form consists of the following elements: 
Context, Problem, Forces, Solutions and Resulting context. 

A pattern is applicable within a Context (description of the world) and creates a 
Resulting Context, as the application of the Pattern will change the state space. 

Problem describes what produces the uncomfortable feeling in a certain situation. 
Forces are keys for pattern understanding. Each force will yield a quality critereon 
for any solution, and as forces can be (and generally are) conflicting, the relative 
importance of forces will drive a good solution into certain areas of the solution 
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space, hence their name. In many contexts, for instance, the relative importance of 
the conflicting forces of economic, time and quality considerations will render a 
particular solution a good or a bad compromise. 

When a problem, forces as problem descriptors, are well understood, then a solu- 
tion is most often easily evaluated. Understanding of a problem is crucial for finding 
a solution. Patterns are functions that transform problems in a certain context into 
solutions. Patterns are familiar and popular concepts, because they systematize re- 
peatedly occuring solutions in nature. The solution, the pattern itself, resolves forces 
in a problem and provides a good solution. On the other hand, a pattern is always a 
compromise, it is not easy to recognize. Because it is a compromise it resolves some 
forces, but may add to the context space new ones. 

A pattern is best recognized through solving and generalizing real problems. The 
quality and applicability of patterns may change over time as new forces gain rele- 
vance or new solutions become available. The process of recognizing and deploying 
patterns is continuous. For example, house building changed very much when con- 
crete was invented. 

3.1 The Condense Pattern: a popular DM pattern 

The pattern is shown in the Coplien form is 

1 . Context. The collection of data is completed. 

2. Problem'. Data matrix is too large for efficient handling. 

3. selected Forces: Efficiency of DM algorithms depends upon the number of cases 
and variables. Irrelevant cases and variables will hamper learning capabilities of 
DM algorithm. Leaving out a case or a variable may lead to errors and delete 
special, but important cases. 

4. Solution: Condense the data matrix 

5. Resulting context: manageable data matrix with some information loss 



The Condense pattern is a typical preprocessing pattern that has found diverse ap- 
plications, for example on variables (by - for example - calculating a score, choosing 
a representative variable or through clustering of variables), on cases (through sam- 
pling, clustering with subsequent use of centers only, etc) or in transformation of 
continuous variables (e.g. through equal width binning, equal frequency binning). 

3.2 The Divide et Impera Pattern 

A second pattern which is widely used in data mining is Divide et Impera. It can also 
be described in the Coplien pattern form: 

1. Context: A data mining problem is too large/complicated to be solved in one 
step. 

2. Problem: Structuring of the task 
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3. Forces: It is not possible to subdivide the problem, there are many strongly in- 
terrelated facets influencing the problem. The sheer combination of subproblem 
solutions may be grossly suboptimal. Subproblems may have very different rel- 
evance for the global problem. Complexity of a generated subproblem may be 
grossly out of proportion to its relevance. Solution: Divide the problem into sub- 
problems that are more easily solved (and quite often structurally similar to the 
original one) and huild the solution to the complete problem as a combination. 

4. Resulting context: a set of smaller problems, more palatable to solution 

- It is possible that the problem structure is bad or the effort has not 
been reduced in sum. 

- The effort has not been reduced in sum. 

The Divide et Impera pattern can be used for problem structuring where the prob- 
lem is too complex to solve it in one step. It is found as a typical meta heuristic in 
many algorithms such as decision trees or divisive clustering. Other application ar- 
eas, which also vouch for its broad applicability, are segmented marketing (if an 
across-the-board marketing strategy is not feasible, try to form homogeneous seg- 
ments and cater to their needs), or the division of labor within divisional organiza- 
tions. 

3.3 More patterns in data mining 

We have already identified a lot of other patterns in the field of data mining. Some of 
them are: 

• Combine voting(with boosting, bagging, stacking, etc. as corresponding algo- 
rithms), 

• Training / Retraining (supervised mining, etc.), 

• Solution analysis, 

• Categorization and 

• Normalization 

This list is in no way closed. Every area of human interest has its characteristic 
patterns. However, there is not an infinite number of patterns, but always a limited 
one. Collecting them and making them available for users gives the users the pos- 
sibility to model the DM process, but also to understand the DM process through 
patterns. 



4 Summary and outlook 

Pattern based data-mining offers some attractive features 

1 . The algorithm creators and the algortihm users have different interests and dif- 
ferent need. These sides often don’t understand each others needs and, quite 
often, do not need to know about specific details relevant to the other side. A 
pattern is something that is understandable to all people. 
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2. Science converges. Concepts in one area of science is applicable in another area. 
Patterns support these processes. This potential is comparable to the promises of 
Systems Theory. 

3. Decision for a specific algorithm can be postponed to later stages. A solution 
path as a whole will be sketched through patterns and algorithms need only be 
filled in immediately prior to processing. Using differnet algorithms in places 
will not invalidate the solution path, creating “late binding” at the algorithm 
level. 

Current Data Mining applications occasionally provide the user with first traces 
of pattern based DM. Figure 5 shows the example of Bagging of Classifiers within 
the TANAGRA project and its graphical user interface (Rakotomalala (2004)). Bag- 
ging cannot be described with a pure data flow paradigm, rather a nesting of a clas- 
sifier pattern within the bagging pattern is needed. This nested structure will then be 
pipelined with pre- and postprocessing patterns. 




Fig. 5. Screenshot of Tanagra Software 



Further steps in our project are to 

• collect a list of patterns which are useful in the whole knowledge dis- 
covery process and data mining (list will be open-ended). 

• integrate these patterns into data mining software to help design ad-hoc 
algorithms, choose an existing one or have guidance in the data mining 
process. 

• develop a software prototype with our pattern and make experiments 
with users: how it works and what are the benefits. 
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Abstract. Entity identification deals with matching records from different datasets or within 
one dataset that represent the same real-world entity when unique identifiers are not available. 
Enabling data integration at record level as well as the detection of duplicates, entity identifi- 
cation plays a major role in data preprocessing, especially concerning data quality. This paper 
presents a framework for statistical entity identification in particular focusing on probabilistic 
record linkage and string matching and its implementation in R. According to the stages of 
the entity identification process, the framework is structured into seven core components: data 
preparation, candidate selection, comparison, scoring, classification, decision, and evaluation. 
Samples of real-world CRM datasets serve as illustrative examples. 



1 Introduction 

Ensuring data quality is a crucial challenge in statistical data management aiming at 
improved usability and reliability of the data. Entity identification deals with match- 
ing records from different datasets or within a single dataset that represent the same 
real-world entity and, thus, enables data integration at record level as well as the 
detection of duplicates. Both can be regarded as a means of improving data qual- 
ity, the former by completing datasets through adding supplementary variables, re- 
placing missing or invalid values, and appending records for additional real-world 
entities, the latter by resolving data inconsistencies. Unless sophisticated methods 
are applied, data integration is also a potential source of ‘dirty’ data: duplicate or 
incomplete records might be introduced. Besides its contribution to data quality, en- 
tity identification is regarded as a means of increasing the efficiency of the usage 
of available data as well. This is of particular interest in official statistics, where 
the reduction of the responder burden is a prevailing issue. In general, applications 
necessitating statistical entity identification (SEI) are found in diverse fields such as 
data mining, customer relationship management (CRM), bioinformatics, criminal in- 
vestigations, and official statistics. Various frameworks for entity identification have 
been proposed (see for example Denk (2006) or Neiling (2004) for an overview), 
most of them concentrating on particular stages of the process, such the author’s 
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SAS implementation of a metadata framework for record linkage procedures (Denk 
(2002)). Moreover, commercial as well as ‘governmental’ software (especially from 
national statistical Institutes) is available (for a survey cf. Herzog et al. (2007) or Gill 
( 2001 )). 

Based on the insights gained from the EU FP5 research project DIECOFIS (Denk 
et al. (2004 & 2005)) in the context of the integration of enterprise data sources, a 
framework for statistical entity identification has been designed (Denk (2006)) and 
implemented In the free software environment for statistical computing R (R De- 
velopment Core Team (2006)). Section 2 provides an overview of the underlying 
methodological framework, Section 3 introduces its implementation. The function- 
ality of the framework components is discussed and illustrated by means of demo 
samples of real-world CRM data. Section 4 concludes with a short summary and an 
outlook on future work. 



2 Methodological framework 

Statistical entity identification aims at finding a classification rule assigning each 
pair of records from the original dataset(s) to the set of links (identical entities or 
duplicates) or the set of non-links (distinct entities), respectively. Frequently, a third 
class is introduced containing undetermined record pairs (possible links/duplicates) 
for which the final linkage status can only be set by using supplementary Information 
(usually obtained via clerical review). The process of deriving such a classification 
rule can be structured into seven stages (Denk (2006)). In the initial data prepara- 
tion stage, matching variables are defined and undergo various fransformafions fo 
become suifable for the usage in the ensuing processing stages. In particular, string 
variables have to be preprocessed to become comparable among datasets (Winkler 
(1994)). In the candidate selection or filtering stage, candidate record pairs with a 
higher likelihood of representing identical real-world entities are selected (Baxter 
et al. (2003)), since a detailed comparison, scoring, and classification of all possi- 
ble record pairs from the cross product of the original datasets is extremely time- 
consuming (if accomplishable at all). In the third stage, the comparison or profiling 
stage, similarity profiles are determined which consist of compliance measures of the 
records in a candidate pair with respect to the specified mafching variables, in which 
the treatment of string variables (Navarro (2001)) and missing values is the most 
challenging (Neiling (2004)). Based on the similarity patterns, the scoring stage esti- 
mates matching scores for the candidate record pairs. In general, matching scores are 
defined as ratios of the conditional probabilities of observing a particular similarity 
pattern provided that the record pair is a true match or non-match respectively, or as 
the binary or natural logarithm thereof (Fellegi and Sunter (1969)). The conditional 
probabilities are estimated via the classical EM algorithm (Dempster et al. (1977)) 
or one of the problem-specific EM varianfs (Winkler (1994)). In the ensuing clas- 
sification stage classification rules are determined. Especially in the record linkage 
setting, rules are based on prespecified error levels for erroneous links and non-links 
through two score thresholds that can be directly obtained from the estimated condi- 
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tional probabilities (Fellegi and Sunter (1969)) or via comparable training data with 
known true matching status (Belin and Rubin (1995)). In the decision stage, exam- 
ined record pairs are finally assigned to the set of links or non-links and inconsistent 
values of linked records with respect to common variables are resolved. If 1 :n or 1:1 
assignment of records is targeted, the m:n assignment resulting from the classifica- 
tion stage has to be refined (Jaro (1989)). The seventh and final stage focuses on 
the evaluation of the entity identification process. Training data (e.g. from previous 
studies or from a sample for which the true matching status has been determined) 
are required to provide sound estimates of quality measures. A contingency table of 
the true versus the estimated matching status is used as a basis for the calculation of 
misclassification rates and other overall quality criteria, such as precision and recall, 
which can be visualized for varying score thresholds. 



3 Implementation 

The SEI framework is structured according to the seven stages of the statistical en- 
tity identification process. For each stage there is one component, i.e. one function, 
that establishes an interface to the lower-level functions which implement the respec- 
tive methods. The outcome of each stage is a list containing the processed data and 
protocols of the completed processing stages. Table 1 provides an overview of the 
functionality of the components and the spectrum of available methods. Methods not 
yet implemented are italicised. 

3.1 Sample data 

As an illustrative example, samples of real-life CRM datasets are used originating 
from a register of casino customers and their visits (approx. 150,000) and a survey 
on customer satisfaction. Common (and thus potential matching) variables are first 
and last name, sex, age group, country, region, and five variables related to previous 
visits and the playing behaviour of the customers (visitl, visit2, visit3, visit4, and 
lottery). The demo datasets correspond to a sample of 100 survey records for which 
the visitor ID is also known and 100 register entries from which 70 match the survey 
sample and the remaining 30 were drawn at random. I.e., the true matching status of 
all 10,000 record pairs is known. The data snippet shows a small subset of the first 



dataset. 


fname 


sex 


agegroup 


country 


visitl 


visit2 


711 


GERALD 


m 


41-50 


Austria 


1 


1 


13 


PAOLO 


m 


41-50 


Italy 


1 


1 


164988 


WEIFENG 


m 


19-30 


other 


0 


1 



3.2 Data preparation 

preparation (data, variable, method, label, ...) provides an interface to 
the phoncode ( ) function from the StringMatch toolbox (Denk (2007)) as well as 
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Table 1. Component Functionality and Methodological Range 



Component 


Functionality 


Methods 


Preparation 


parsing 


address and name parsing in dijferent languages 




standardisation 


dictionary provided by the user 
integrated dictionaries 




phonetic coding 


American Soundex, Original Russel Soundex 
NYSIIS, ONCA, Daitch-Mokotoff, 

Koelner Phonetik, Reth-Schek-Phonetik 
(Double) Metaphone, Phonex, Phonet, Henry 


Filtering 


single-pass 


cross product / no selection, blocking, 
sorted neighbourhood, string ranking 
hybrid 




multi-pass 


sequence of single-pass 


Comparison 


universal 


binary, frequency-based 




metric variables 


tolerance intervals, (absolute distance)^, Canberra 




string variables: 
phonetic coding 


see above 




string variables: 


Jaccard, n -gram, maximal match. 




token-based 


longest common subsequence, TF-IDF 




string variables: 


Damerau-Levenstein, Hamming, Needleman- 




edit distances 


Wunsch, Monge-Elkan, Smith-Waterman 




string variables: 


Jaro, Jaro-Winkler 




Jaro algorithms 


Jaro-McLaughlin, Jaro-Lynch 


Scoring 


binary outcomes 


two-class EM 

two-class EM interactions, three-class EM 




frequency based 


Eellegi-Sunter, two-class EM frequency based 




similarities 


two-class EM approximate 




any 


logistic regression 


Classification 


no training data 


Eellegi-Sunter empiric, Eellegi-Sunter pattern 




training data 


Belin-Rubin 


Decision 


assignment 


greedy 

LSAP 




review 


possible links, inconsistent values 


Evaluation 


confusion matrix 


absolute, relative 




quality measures 


false match rate Eellegi-Sunter & Belin-Rubin, 
false non-match rate Eellegi-Sunter & Belin-Rubin, 
accuracy, precision, recall, f-measure, specificity, 
unclassified pairs 




plots 


varying classification rules 



the functions standardise ( ) and parse ( ) . By this means, preparation ( ) pho- 
netically codes, standardizes, or parses the variable(s) in data frame data accord- 
ing to the specified method(s) (default: American Soundex ( ' asoundex ' )) and ap- 
pends the resulting variable(s) with the defined label(s) to the data. The default 
label is composed of the specihed variables and methods. At the moment, a selec- 
tion of popular phonetic coding algorithms and standardization with user-provided 
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dictionaries are implemented, whereas parsing is not yet supported. The ellipsis indi- 
cates additional method-specific arguments, e.g., the dictionary according to which 
standardisation should be carried out. The following code chunk illustrates the usage 
of the function. 

> preparation (data=dl , variable= ' Iname ' , 
method= ' asoundex ' ) 

... Iname ... asoundex. Iname 

115256 ... WESTERHEIDE ... W236 

200001 . . . BESTEWEIDE . . . B233 

200002 ... WESTERWELLE ... W236 

3.3 Candidate selection 

candidates (datal , data2, method, selvarsl, selvars2, keyl, key2, 
...) provides an interface to the functions crossproduct () , blocking!), 
sortedneighbour () , and stringranking () . Candidate record pairs from data 
frames datal and data2 are created and filtered according to the specified method 
(default: ' blocking ' ). In case of a deduplication scenario, data2 does not have to 
be specified, selvarsl and selvars2 specify the variables that the filtering is based 
on. The ellipsis indicates additional method-specific arguments, e.g. the extent k of 
the neighbourhood for sorted neighbourhood filtering or the string similarity mea- 
sure to be used for string ranking. The following examples illustrate the usage of 
the function. In contrast to the full cross product of the datasets with 10,000 record 
pairs, sorted neighbourhood by region, age group, and sex reduces the list of can- 
didate pairs to 1,024, and blocking by Soundex code of last name retains only 83 
candidates. 

> candidates (datal=dl .prep , data2=d2 .prep, 
method^ 'blocking ' , selvarsl= 'asoundex . Iname ' ) 

> candidates (datal=dl .prep, data2=d2 .prep, 
method= ' sorted ' , selvarsl=c 

('region', ' agegroup ' , 'sex'), k=10) 

3.4 Comparison 

comparison (data, matchvarl, matchvar2, method, label, ...) makes 
use of the stringsim! ) function from the StringMatch toolbox (Denk (2007)) 
as well as the functions simplecomp ( ) for simple (dis-)agreement and metcomp ( ) 
for similarities of metric variables, comparison! ) computes the similarity profiles 
for the candidate pairs in data frame data with respect to the specified matching 
variable(s) matchvarl, matchvar2 according to the selected method and appends 
the resulting variable(s) with the defined label(s) to data. The ellipsis indicates ad- 
ditional method-specific arguments, e.g. different types of weights for Jaro or edit 
distance algorithms. In the current implementation, missing values are not specially 
treated. 
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> comparison (data=dl2 , matchvarl=c ( ' fname . dl ' , 

' Iname . dl ' , ' visitl . dl ' ) , matchvarl=c (' fname . dl ' , 
' Iname . dl ' , 'visitl .dl ' ) , 
method=c ( ' jaro ' , 'asoundex', 'simple')) 



fname . dl fname . d2 

1 GERALD SELJAMI 

2 PAOLO SELJAMI 

3 WEIFENG SELJAMI 



jaro. fname c . asound. Iname simple .visitl 



0.53175 0.00000 
0.39524 0.00000 
0.42857 0.00000 



0.00000 

0.00000 

1.00000 



3.5 Scoring 



scoring (data, profile, method, label, wtype, ...) estimates matching 
scores for the candidate pairs in data frame data from the specified similarity 
profile according to the selected method and appends the resulting variable with 
the defined label to the data, wtype indicates the score to be computed, e.g. ' LR ' 
for likelihood ratio (default). The ellipsis indicates additional method-specific argu- 
ments, for example the maximum number of iterations for the EM algorithm. The 
following example illustrates the usage of the function. The output is shown together 
with the output of classification ( ) and decision ( ) in section 3.7. 



> scoring (data=dl2 , profile=31:39, method= ' EMOl ' , 
wtype= ' LR ' ) 



3.6 Classification 

classification (data, scorevar, method, mu, lambda, label, ...) 
determines a classification rule for the candidate pairs in data frame data according 
to the selected method (default: empirical Fellegi-Sunter) based on prespecified error 
levels mu and lambda and the matching score in scorevar. The estimated matching 
status is appended to the data as a variable with the defined label. The ellipsis 
indicates additional method- specific arguments, for instance a data frame holding the 
training data and the position or label of the true matching status trainstatus. 
The following example illustrates the usage of the function. The result is shown in 
section 3.7. 

> classification(data=dl2, scorevar= ' score . EMOl ' , 
method^ ' FSemp ' ) 



3.7 Decision 

decision (data, keys, scorevar, classvar, atype, method, label, 

. . . ) provides an interface to the function assignment ( ) that enables 1:1, l:n/n:l 
and particular m:n assignments of the examined records. Eventually, features sup- 
porting the review of undetermined record pairs and inconsistent values in linked 
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pairs are intended, decision ( ) comes to a final decision concerning the matching 
status of the record pairs in data frame data based on the preliminary classifica- 
tion in classvar, the matching score scorevar, and the specified method (default: 

' greedy ' ). keys specifies the positions or labels of the key variables referring to 
the records from the original data frames, atype specifies the target type of assign- 
ment (default: ' 1 : 1 ' ). A variable with the defined label is appended to the data. 
The ellipsis indicates additional method-specific arguments not yet determined. The 
following example illustrates the usage of the function. In this case, 60 pairs first 
classified as links as well as all 112 possible links were transferred to the class of 
non-links. 



> decision (data=dl2 , keys=l:2, scorevar= ' score . EMOl ' , 
classvar= ' class . FSemp ' , atype= ' 1 : 1 ' , method= ' greedy ' ) 



fname . dl 

1 GERALD 

2 PAOLO 

3 WEIFENG 



fname.d2 

SELJAMI 

SELJAMI 

SELJAMI 



score. EMOl class. FSemp 1:1. greedy 

6.848e-03 L N 

1.709e-04 P N 

1.709e-05 P N 



3.8 Evaluation 

evaluation (data, true, estimated, basis, plot, xaxes, yaxes, ...) 
computes the confusion matrix and various quality measures, e.g. false match and 
non-match rates, recall, precision, for the given data frame data containing the can- 
didate record pairs with the estimated and true matching status, basis discerns 
whether the confusion matrix and quality measures should be based on the num- 
ber of ' pairs ' (default) or the number of ' records ' . plot is a flag indicating 
whether a plot of two quality measures xaxes and yaxes, typically precision and 
recall, should be created (default: FALSE). The ellipsis indicates additional method- 
specific arguments not yet determined. The following example illustrates the usage 
of the function. 

> evaluation (data=dl2 , true= ' true ' , 
estimated= '1:1. greedy ' ) 



4 Conclusion and future work 

The SEI framework introduced in this paper poses a considerable step towards statis- 
tical entity identification in R. It consists of seven components according to the stages 
of the entity identification process, viz. the preparation of matching variables, the se- 
lection of candidate record pairs, the creation of similarity patterns, the estimation 
of matching scores, the (preliminary) classification of record pairs into links, non- 
links, and possible links, the final decision on the classification and on inconsistent 
values in linked records, and the evaluation of the results. The projected and current 
range of functionality of the framework were presented. Future work consists in the 
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explicit provision for missing values in the framework as well as the implementa- 
tion of additional algorithms for the most components. The main focus is on further 
scoring and classification algorithms that significantly contribute to the completion 
of the framework which will finally be provided as an R package. 
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Abstract. The very rapid adoption of new applications by some segments of the ADSL cus- 
tomers may have a strong impact on the quality of service delivered to all customers. This 
makes the segmentation of ADSL customers according to their network usage a critical step 
both for a better understanding of the market and for the prediction and dimensioning of the 
network. Relying on a “bandwidth only" perspective to characterize network customer be- 
haviour does not allow the discovery of usage patterns in terms of applications. In this paper, 
we shall describe how data mining techniques applied to network measurement data can help 
to extract some qualitative and quantitative knowledge. 



1 Introduction 

Broadband access for home users and small or medium business and especially 
ADSL (Asymmetric Digital Subscriber Line) access is of vital importance for 
telecommunication companies, since it allows them to leverage their copper infras- 
tructure so as to offer new value-added broadband services to their customers. The 
market for broadband access has several strong characteristics: 

• there is a strong competition between the various actors, 

• although the market is now very rapidly increasing, customer retention is impor- 
tant because of high acquisition costs, 

• new applications or services may be picked up very fast by some segments of the 
customers and the behaviour of these applications or services may have a very 
strong impact on the quality of service delivered to all customers (and not only 
those using these new applications or services). 

Two well-known examples of new applications or services with possibly very de- 
manding requirements in term of bandwidth are peer-to-peer file exchange systems 
and audio or video streaming. 

The above characteristics explain the importance of an accurate understanding 
of the customer behaviour and a better knowledge of the usage of broadband access. 
The notion of “usage" is slowly shifting from a “bandwidth only" perspective to a 
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much broader perspective which involves the discovery of usage patterns in terms of 
applications or services. The knowledge of such patterns is expected to give a much 
better understanding of the market and to help anticipate the adoption of new services 
or applications by some segments and allow the deployment of new resources before 
the new usage effects hit all the customers. 

Usage patterns are most often inferred from polls and interviews which allow an 
in-depth understanding but are difficult to perform routinely, suffer from the small 
size of the sampled population and cannot easily be extended to the whole popula- 
tion or correlated with measurements (Anderson et al. (2002)). “Bandwidth only" 
measurements are performed routinely on a very large scale by telecommunication 
companies (Clement et al. (2002)) but do not allow much insight into the usage pat- 
terns since the volumes generated by different applications can span many orders of 
magnitude. 

In this paper, we report another approach to the discovery of broadband cus- 
tomers’ usage patterns by directly mining network measurement data. After a de- 
scription of the data used in the study and their acquisition process, we explain the 
main steps of the data mining process and we illustrate the ability of our approach to 
give an accurate insight in terms of usages patterns of applications or services while 
being highly scalable and deployable. We focus on two aspects of customers’ usages: 
usage of types of applications and customers’ daily traffic; these analyses suppose to 
observe the data at several levels of detail. 



2 Network measurements and data description 

2.1 Probes measurements 

The network measurements are performed on ADSL customer traffic by means of 
a proprietary network probe working at the SDH (Synchronous Digital Hierarchy) 
level between the Broadband Access Server (BAS) and the Digital Subscriber Line 
Access Multiplexer (DSLAM). This on-line probe allows to read and store all the 
relevant helds of the ATM (Asynchronous Transfer Mode) cells and of the IP/TCP 
headers. From now, 9 probes equip the network; they observe about 18000 customers 
non-stop (a probe can observe about 2000 customers on a physical link). Once the 
probe is in place, data collection is performed automatically. A detailed description 
of the probe architecture can be found in (Francois (2002)). 

2.2 Data description 

For the study reported here, we gathered one month of data, on one site, for about two 
thousand customers. The data give the volumes of data exchanged in the upstream 
and downstream directions of twelve types of applications (web, peer-to-peer, ftp, 
news, mail, db, control, games, streaming, chat, others and unknown) sampled for 
each 6 minutes window for each customer. Most of the types of applications corre- 
spond to a group of well-known TCP ports, except the last two which relate to some 
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well known but “obscure" ports (others) or dynamic ones (unknown). Since much 
of peer-to-peer traffic uses dynamic ports, peer-to-peer applications are recognized 
from a list of application names by scanning the payloads at the application level 
and not by relying on the well-known ports only. This is done transparently for the 
customers; no other use is made of such data than statistical analysis. 
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Fig. 1. Volume of the traffic on the applications 




Fig. 2. Average hourly volume 



Figure 1 plots the distribution of the total monthly traffic on the applications (all 
days and customers included) for one site in September 2003 (the volumes are given 
in bytes). About 90 percent of the traffic is due to peer-to-peer, web and unknown 
applications and all the monitored sites show a similar distribution. Figure 2 plots 
the average hourly volume for the same month and the same site, irrespective of the 
applications. We can observe that the night traffic remains significant. 
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3 Customer segmentation 

3.1 Motivation 

The motivation of this study is a better understanding of the customers’ daily traffic 
on the applications. We try to answer the question: who is doing what and when? 

To achieve this task we have developed a specific data mining process based on 
Kohonen maps. They are used to build successive layers of abstraction starting from 
low level traffic dafa to achieve an interpretable clustering of the customers. 

For one month, we aggregate the data into a set of daily activity profiles given 
by the total hourly volume, for each day and each customer, on each application 
(we confined ourselves fo fhe three most important applications in volume: peer- 
to-peer, web and unknown; an extract of the log file is presented Figure 3). In the 
following, “usage" means “daily activity" described by hourly volumes. The daily 
activity profiles are recoded in a log scale to be able to compare volumes with various 
orders of magnitude. 

3.2 Data segmentation using self-organizing maps 

We choose to cluster our data with a Self Organizing Map (SOM) which is an excel- 
lent tool for data survey because it has prominent visualization properties. A SOM is 
a set of nodes organized into a 2-dimensional^ grid (the map). Each node has fixed 
coordinates in the map and adaptive coordinates (the weights) in the input space. 
The input space is spanned by the variables used to describe the observations. Two 
Euclidian distances are defined, one in the original input space and one in the 2- 
dimensional space. 

The self-organizing process slightly moves the location of the nodes in the data 
definition space -i.e. adjusts weights according to the data distribution. This weight 
adjustment is performed while taking into account the neighbouring relation between 
nodes in the map. 

The SOM has the well-known ability that the projection on the map preserves the 
proximities: observations that are close to each other in the original multidimensional 
input space are associated with nodes that are close to each other on the map. 

After learning has been completed, the map is segmented into clusters, each clus- 
ter being formed of nodes with similar behaviour, with a hierarchical agglomerative 
clustering algorithm. This segmentation simplifies the quantitative analysis of the 
map (Vesanto and Alhoniemi (2000), Lemaire and Clerot (2005)). For a complete 
description of the SOM properties and some applications, see (Kohonen (2001)) and 
(Oja and Kaski (1999)). 

3.3 An approach in several steps for the segmentation of customers 

We have developed a multi-level exploratory data analysis approach based on SOM. 
Our approach is organized in five steps (see Figure 6): 



* All the SOMs in this article are square maps with hexagonal neighborhoods. 
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• In a first step, we analyze each application separately. We cluster the set of all 
the daily activity profiles (irrespective of the customers) by application. For example, 
if we are interested in a classification of web down daily traffic, we only select the 
relevant lines in the log file (Figure 3) and we cluster the set of all the daily activity 
profiles for the application. We obtained a map with a limited number of clusters 
(Figure 4): the typical days for the application. We proceed in the same way for all 
the other applications. 

As a result we end up, for each application, with a set of “typical application 
days" profiles which allow us to understand how the customers are globally using 
their broadband access along the day, for this application. Such “typical application 
days" form the basis of all subsequent analysis and interpretations. 
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volume 
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volume-day-unknown-up- 1 1 
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volume-day- web-down-23 
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day 5 


P2P-down 


volume-day-P2P-down-25 




Fig. 3. log file ; each application volume (last column) is a 

curve similar to the one plotted Figure 2 Fig. 4. Typical Web-down days 



• In a second step we gather the results of previous segmentations to form a 
global daily activity profile: for one given day, the initial traffic profile for an appli- 
cation is replaced by a vector with as many dimensions as segments of typical days 
obtained previously for this application. 

The profile is attributed to its cluster; all the components are set to zero except the 
one associated with the represented segment (Figure 5). This component is set to one. 
We do the same for the other applications. The binary profiles are then concatenated 
to form the global daily activity profile (the applications are correlated at this level 
for the day). 

• In a third step, we cluster the set of all these daily activity profiles (irrespec- 
tive of the customers). As a result we end up with a limited number of “typical day" 
profiles which summarize the daily activity profiles. They show how the three appli- 
cations are simultaneously used in a day. 

• In a fourth step, we turn to individual customers described by their own set 
of daily profiles. Each daily profile of a customer is attributed to its “typical day" 
cluster and we characterize this customer by a profile which gives the proportion of 
days spent in each “typical day" for the month. 

• In a fifth step, we cluster the customers as described by the above activity 
profiles and end up with “typical customers". This last clustering allows to link cus- 
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Fig. 5. Binary profile constitution 



tomers to daily activity on applications. 

The process (Figure 6) exploits the hierarchical structure of the data: a customer 
is defined by his days and a day is defined by its hourly traffic volume on the ap- 
plications. At the end of each stage, an interpretation step allows to incrementally 
extract knowledge from the analysis results. The unique visualization ability of the 
self organizing map model makes the analysis quite natural and easy to interpret. 
More details about such kind of approach on another application can be found in 
(Clerot and Fessant (2003)). 

3.4 Clustering results 

We experiment with the site of Fontenay in September 2003. All the segmentations 
are performed with dedicated SOMs (experiments have been done with the SOM 
Toolbox package for matlab (Vesanto et al. (2000)). 
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The first step leads to the formation of 9 to 13 clusters of “typical application 
days" profiles, depending on the application. Their behaviours can be summarized 
into inactive days, days with a mean or high activity on some limited time periods 
(early or late evening, noon for instance), and days with a very high activity on a 
long time segment (working hours, afternoon or night). 

Figure 7 illustrates the result of the first step for one application: it shows the 
mean hourly volume profiles of the 13 clusters revealed after the clustering for the 
web down application (the mean profiles are computed by the mean of all the ob- 
servations that have been classified in the cluster; the hourly volumes are plotted in 
natural statistics). The other applications can be described similarly. 
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hours 



Fig. 7. Mean daily volumes of clusters for web down application 



The second clustering leads to the formation of 14 clusters of “typical days". 
Their behaviours are different in terms of traffic time periods and intensity. The main 
characteristics are a similar activity in up and down traffic directions and a similar 
usage of the peer-to-peer and unknown applications in clusters. The usage of the web 
application can be quite different in intensity. Globally, the time periods of traffic are 
very similar for the three applications in a cluster. 10 percent of the days show a high 
daily activity on the three applications, 25 percent of the days are inactive days. If 
we project the other applications on the map days, we can observe some correlations 
between applications: days with a high web daily traffic are also days with high 
mail, ftp and streaming activities and the traffic time periods are similar. The chat 
and games applications can be correlated to peer-to-peer in the same way. 

The last clustering leads to the formation of 12 clusters of customers which can 
be characterized by the preponderance of a limited number of typical days. 

Figure 8 illustrates the characteristic behaviour of one “typical customer" (cluster 
6) which groups 5 percent of the very active customers on all the applications (with a 
high activity all along the day, 7 days out of 10 and very little days with no activity). 
We plot the mean profile of the cluster (computed by the mean of all the customers 
classified in the cluster (up left, in black). We also give the mean profile computed 
on all the observations (bottom left, in grey), for comparison. 

The profile can be discussed according to its variations against the mean profile 
in order to reveal its specific characteristics. The visual inspection of the left part of 
Figure 8 shows that the mean customer associated with the cluster is mainly active 
on “typical day 12" for 78 percent of the month. The contributions of the other “typ- 
ical days" are low and are lower than the global mean. Typical day 12 corresponds to 
very active days. The mean profile of “typical day 12" is shown in the right top part 
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Typical day 12 




Fig. 8. Profile of one cluster of customers (up left) and mean profile (bottom) and profiles of 
associated typical days and typical application days 



of the figure in black. The day profile is formed by the aggregation of the individual 
application clustering results (a line delimits the set of descriptors for each applica- 
tion). We also give the mean profile computed on all the observations (bottom, in 
grey). 

Typical day 12 is characterized by a preponderant typical application day on each 
application (from 70 percent to 90 percent for each). These typical application days 
correspond to high daily activities. 

For example, we plot the mean profile of “typical day 6" for the peer-to-peer 
down application in the same figure (right bottom; in black the hourly profile of 
the typical day for the application and in grey the global average hourly profile; the 
volumes are given in bytes). These days show a very high activity all along the day 
and even at night for the application (12 percent of the days). Figure 8 schematizes 
and synthesizes the complete customer segmentation process. 

Our step-by-step approach aims at striking a practical balance between the faith- 
ful representation of the data and the interpretative power of the resulting clustering. 
The segmentation results can be exploited at several levels according to the level 
of details expected. The customer level gives an overall view on the customer be- 
haviours. The analysis also allows a detailed insight into the daily cycles of the cus- 
tomers in the segments. The approach is highly scalable and deployable and cluster- 
ing technique used allows easy interpretations. All the other segments of customers 
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can be discussed similarly in terms of daily profiles and hourly profiles on the appli- 
cations. 





Fig. 9. Profile of another cluster of customers (top left) and mean profile (bottom) and profiles 
of associated typical days and typical application days 



We have identified segments of customers with a high or very high activity all 
along the day on the three applications (24 percent of the customers), others segments 
of customers with very little activity (27 percent of the customers) and segments of 
customers with activity on some limited time periods on one or two applications, 
for example, a segment of customers with overall a low activity mainly restricted to 
working hours on web applications. This segment is detailed in Figure 9. 

The mean customer associated with cluster 10 (3 percent of the customers) is 
mainly active on “typical day 1" for 42 percent of the month. The contributions on 
the other “typical days" are close to the global mean. Typical day 1 (4.5 percent of the 
days) is characterized by a preponderant typical application day on web application 
only (both in up and down directions); no specific typical day appears for the two 
other applications. The characteristic web days are working days with a high daily 
weh activity on the segment 10h-19h. 

Figure 10 depicts the organization of the 12 clusters on the map (each of the 
clusters is identified by a number and a colour). The topological ordering inherent to 
the SOM algorithm is such that clusters with close behaviours lie close on the map 
and it is possible to visualize how the behaviour evolves in a smooth manner from 
one place of the map to another. The map is globally organized along an axis going 
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from the north east (cluster 12) to the south west (cluster 6), from low activity to high 
activity on all the applications, non-stop all over the day. 




Customers map 



Heavy users 

(high traffic on all applications) 



with 

few activity 



10h-19h 



activity 



P2P Activity, 
afternoon and evening 



Average activity 



Fig. 10. Interpretation of the learned SOM and its 12 clusters of customers 



4 Conclusion 

In this paper, we have shown how the mining of network measurement data can re- 
veal the usage patterns of ADSL customers. A specific scheme of exploratory data 
analysis has been presented to give lightings on the usages of applications and daily 
traffic profiles. Our data-mining approach, based on the analysis and the interpre- 
tation of Kohonen self-organizing maps, allows us to define accurate and easily 
interpretable profiles of the customers. These profiles exhibit very heterogeneous 
behaviours ranging from a large majority of customers with a low usage of the ap- 
plications to a small minority with a very high usage. 

The knowledge gathered about the customers is not only qualitative; we are also 
able to quantify the population associated to each profile, the volumes consumed on 
the applications or the daily cycle. 

Our methodologies are continuously in development in order to improve our 
knowledge of customer’s behaviours. 
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Abstract. In this paper, we analyze the trading behavior of users in an experimental stock 
market with a special emphasis on irregularities within the set of regular trading operations. To 
this end the market is represented as a graph of traders that are connected by their transactions. 
Our analysis is executed from two perspectives: On a micro scale view fraudulent transactions 
between traders are introduced and described in terms of the patterns they typically produce 
in the market’s graph representation. On a macro scale, we use a spectral clustering method 
based on the eigensystem of complex Hermitian adjacency matrices to characterize the trad- 
ing behavior of the traders and thus characterize the market. Thereby, we can show the gap 
between the formal definition of the market and the actual behavior within the market where 
deviations from the allowed trading behavior can be made visible. These questions are for 
instance relevant with respect to the forecast efficiency of experimental stock markets since 
manipulations tend to decrease the precision of the market’s results. To demonstrate this we 
show some results of the analysis of a political stock market that was set up for the 2006 state 
parliament elections in Baden- Wuerttemberg, Germany. 



1 Introduction 

Stock markets do not only attract the good traders but also the ones who try to ma- 
nipulate the market. The approaches used by malign traders differ with respect to the 
design of the market, but altogether tend to bias its outcome. In this contribution, we 
present basic behavior patterns that are characteristic of irregular trading activities 
and discuss an approach for their detection. We concentrate on patterns that tend to 
appear in prediction markets. In the first section we adopt a micro scale perspec- 
tive, describing the traders’ individual motivation for malicious actions and deriving 
the characteristics of two basic patterns. The second and third section approach a 
market’s transaction records from a broader (macro) view. The market data is ana- 
lyzed by means of a clustering method on the results of a certain type of eigensystem 
analysis, finding a reliable way of discovering the patterns sought. 
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2 Irregular trading behavior in a market 

There are several incentives to act in a fraudulent way which result in the basic 
patterns price manipulation and circular trading. In this introductory section, we will 
show these basic patterns that constitute the micro scale view of the market activities. 
They can be made visible when the money or share flows in the market are used to 
generate a graph of traders and flows as shown in section 4.2. 

Price manipulation are for instance motivated by idealistic reasons: Members 
of parties that may or may not take the hurdle of flve percent introduced by German 
electoral laws have an Incentive to set a price slightly above 5% in order to signal that 
every vote given for this party counts. This in turn is expected to motivate electors 
to vote who have not yet decided whether to vote at all (see Franke et al. (2006) 
and Hansen et al. (2004)). On the other hand, opponents may be induced to lower 
the prices for a rivaling party in order to discourage the voters of this party. These 
cases are quite easily detectable, since traders without such a bias in their trading 
behavior should have an approximately balanced ratio between buy and sell actions 
- this includes the number of offers and transactions as well as trading volumes. 
Manipulators, on the other hand, have a highly imbalanced ratio, since they become 
either a sink (when increasing the price) or a source (when decreasing the price) of 
shares of the respective party. Thus, these traders can be found by calculating the 
ratios for each trader and each share and setting a cutoff. 

The other basic micro pattern, circular trading, is egoistically motivated; its ob- 
jective is to increase the trader’s endowment by transferring money (either in mon- 
etary units or in undervalued shares) from one or several satellite accounts that are 
also controlled by the fraudulent trader to a central account. In its most extreme form, 
the pattern leads to many accounts with a balance close to zero that have only traded 
with one other account in a circular pattern: shares are sold by the satellite to the cen- 
tral account at the lower end of the spread and then bought back at the higher end of 
the spread, resulting in a net flow of money from the satellite to the central account. 
Often this pattern is preluded by a widening of the spread by buying from and selling 
to the traders whose offers form the spread boundary in order to increase the leverage 
of each transaction between the fraudulent accounts. We have seen cases where the 
order book was completely emptied prior to the money transfer. This pattern is only 
present in markets where the cost of opening an account lies below the benefit of 
doing so, i.e. the initial endowment given to the trader. 

While the most extreme form Is easily detectable, we need a criterion for the 
more subtle forms. In terms of the flows between traders, circular trading implies 
that transferring a similar number of shares in each direction, the amounts of money 
exchanged differ signiflcantly. In other words, there is a nontrivial net flow of money 
to the fraudulent account from one or several accounts to the central, fraudulent one. 
The problem lies here in the definition of net flow. Optimally, it should be calculated 
as the deviation from the “true” price at the time of the offer or trade times the number 
of shares. Unfortunately, the true price is only known at the close of the market. As 
a remedy, the current market price could be used. However, as we have seen, the 
market price may be manipulated and thus is quite unreliable, especially during the 
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periods in which fraud occurs. The other, favorable approach is to use the volume of 
the trades, i.e. the number of shares times the price, as a substitute. For subsequent 
transactions with equal numbers of shares, the net flow is equivalent to the difference 
in the volumes, for other types of transactions, this is at least an approximation that 
facilitates the detection of circular trading. 



3 Analysis of trading behavior with complex valued Eigensystem 
analysis 

To analyze the market on a macro scale, we use an eigensystem analysis method. 
The method is fully described in Geyer-Schulz and Hoser (2005). In the next two 
sections we will give a short introduction to the technique and the necessary results 
with respect to the following analysis. 

3.1 Spectral analysis of Hermitian adjacency matrices 

The eigensystem analysis described in Geyer-Schulz and Hoser (2005) results in a 
full set of eigenvalues (spectrum) A with Xi,X2, . . . ,X[ and their corresponding eigen- 
vectors X with xi ,X 2 , . . . ,x; where the properties of the flow representation guaran- 
tee that the matrix becomes Hermitian and thus the eigenvalues are real while the 
components of the eigenvectors can be complex. This eigensystem represents a full 
orthonormal system where A and X can be written in the Fourier sum representation 
Y^k=i^kPk = H with Pj. = x^x^; H denotes the linear transformation H = Ac 
with Ac= A + i ■A‘^ and A the real valued adjacency matrix of the graph. The pro- 
jectors Pj^ are computed as the complex outer product of x^^ and represent a sub- 
structure of the graph. We identify the relevant projectors by their covered data vari- 
ance which can be calculated from the eigenvalues since the overall data variance 
is given as detect the most central vertex in the graph by its absolute 

value \x„,ax,m\ of the eigenvector component corresponding to the largest eigenvalue 
\Xmax\- This also holds for the most central vertices in each substructure identified by 
the projectors Pk- 

3.2 Clustering within the eigensystem 

Given the eigensystem as introduced in the last section we take the set of positive 
eigenvalues A+ with X'j^ ,X^ , ■ ■ ■ ,X^ and their corresponding eigenvectors X+ with 
Xj^jXj , . . . ,x,^ and build the matrix Rn^t = {X^ Xj^|XjxJ| . . . With this ma- 

trix and its complex conjugate we build the matrix Snxn = R*R* as the scalar product 
matrix. Since we work in Hilbert space, distances are defined by the following scalar 
products: ||x — yp = (x — y|x — y) = |lx|p-|-||y|p — 2Re{{x | y)). Distances become 
minimal if the real part of the scalar product becomes maximal. Within this matrix 
S we find the clusters pi^ by assigning the vertices of the network to the cluster such 
that a vertex i belongs to a cluster pj^ if Re{Si^pj^) = maxjRe{Sij). As at least one of 
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the eigenvalues of A has to be negative due to = 0, the minimum number 

of clusters is at least one, at most Z — 1 for the analyzed network. For details to this 
approach see Hoser and Schroder (2007). 



4 Analysis of the dataset 

When analyzing an actual market to discover fraudulent traders, the basic patterns 
introduced in section 2 reflect these traders’ behavior (or part of it) within the market. 
To describe the actions taken by the traders we use the eigensystem analysis together 
with the spectral clustering method described in section 3. In order to demonstrate 
the use of this powerful method, we transform the transaction data of the market into 
a network as detailed in section 4.2. Eigensystem analysis is advantageous for the 
analysis here as it takes into account not only the relations from one node to the next, 
but computes the status of one node recursively from the information on the status of 
all other nodes within the network and is therefore referred to as a centrality measure 
(for the idea see Brin and Page (1998)). 

4.1 Description of the dataset 

We analyze a dataset generated by the political stock market system PSM used for 
the prediction of the 2006 state elections in Baden- Wuerttemberg, Germany. The 
traders were mainly readers of rather serious and politically balanced newspapers all 
over the election region. The market ran from January, 31st 2006 until election day 
on March, 26th 2006 for about twelve weeks and was stopped with the closing time 
of the polling stations at 18:00 GET when the first official information on the voters’ 
decision is allowed to be released. More detailed data on the market is given in Table 
1 . 



Table 1. Statistical Data on the 2006 state parliament elections in Baden- Wuerttemberg in 
Germany 



Number of traders (at least one sell or buy transaction) 

Number of traders (at least one sell transaction) 

Number of traders (at least one buy transaction) 

Number of transactions 

Number of shares 

Avg. volume per trade 

Avg. money flow per trade 

Money flow in total 

Share flow in total 



306 traders 

190 traders 

291 traders 

10786 transactions 

7 shares 

214.6 shares 

2462.1 monetary units 

26556378 monetary units 

2314197 shares 



Traders in the market are given 100.000 monetary units (MU) as initial endow- 
ment. The market itself ran a continuous double auction market mechanism where 
offers by traders are executed immediately if they match. Eor each share an order 
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Fig. 1. Eigenvectors of the traders within the most prominent clusters 

book is provided by the system where buy and sell offers are added and subsequently 
removed in the case of matching or withdrawal. 

4.2 Generating the network 

In markets with a central market instance the traders usually communicate only with 
this central instance; trades are executed against “the market”, and a direct communi- 
cation between traders does not take place. This results in an anonymous two-mode 
network perspective where a trader has no information on his counterparts in the 
offers as well as in the transaction partners. As the idea of the fraudulent action in 
the circular trading pattern from section 2 essentially deals with the knowledge of 
the counterpart trader we build the trader- to-trader network where the set of nodes 
consists of the traders that appeared in the transaction records. These traders have 
issued at least one offer that was matched and executed hy the market mechanism. 
The edges of the network are set as the monetary flow between each pair of traders 
in the network (price times number of shares). 

4.3 Results of the analysis 

Within an agile market with random and normal distributed matching between the 
acting traders, good traders should appear in this analysis with a relatively balanced 
money flow as argued in section 2. Acting in fraudulent patterns, on the other hand, 
leads to a bias of these flows regarding the fraudulent trader, his counterparts and 
their connected traders. 

Applying eigensystem analysis to the complex valued adjacency matrix as de- 
fined in section 3 reveals the patterns of trading behavior within the data set. The 
spectrum of the market shows symmetry since the largest and smallest eigenvalues 
have the same absolute value, but different signs. A symmetric spectrum points to- 
wards a star-structured graph. The variances given in Figure 1 reveal that the first 
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pattern (first two eigenvalues) already describes about 62% of the data variance. To 
reach more than 80% of the data variance it is sufficient to look at the first 14 eigen- 
vectors; these are shown in Figure 1. On the top and bottom of the figure the IDs of 
the traders are given. On the right hand side the sign of the corresponding eigenvalue 
is depicted, since, as explained in section 3.1 positive and negative eigenvalues ex- 
ist. On the left hand side the covered data variance for each eigenvector is given. The 
eigenvectors are represented as rows from top to bottom, with the eigenvectors corre- 
sponding to the highest absolute eigenvalues in the top rows, and those corresponding 
to lower absolute eigenvalues listed consecutively. Normally, each eigenvector com- 
ponent is represented as a colored square. The color saturation reflects the absolute 
value for this component, while the color itself reflects the phase of the absolute val- 
ued eigenvector component. In the hlack and white graphic in Figure 1 both values 
had to be combined in the shade of grey. 

As can clearly he seen, there are four blocks ci - C 4 in this figure. The block 
Cl in the upper left hand corner shows that traders with IDs 1847 und 1969 had an 
almost balanced trading communication between them, and the volume was large. 
The second block C 2 in the middle of the figure represents the trading behavior of the 
group of traders with IDs 1922, 1775, 1898 and 1858. Here it can be stated that the 
connection between 1922, 1898 and 1858 is quite strong, and the trading behavior 
was nearly balanced between 1922 and 1858, while the behavior between 1922 and 
1898 has a stronger outbound direction from 1922 to 1858. Between the first and 
second block the eigenvectors 3 and 4 describe normal trading behavior as defined 
by the market. The third block c$ shows the traders with IDs 1924 and 1948. These 
again show a nearly balanced behavior, as do the traders with IDs 1816 and 1826 in 
the lower right hand corner of the figure. 

These results were compared to the trading data in the data base. The result is 
given in Figure 2. The setup is similar to Figure 1 and it can easily be verified that 
the trading behavior is consistent with the results from the eigensystem analysis. 
Whenever the eigensystem analysis revealed a nearly balanced trading behavior this 
holds true even if the absolute values of transactions are different, since the order of 
magnitude stays approximately the same. The important aspect lies in the difference 
between the values, as it shows the transfer of money from one trader to the other. 

It can thus be seen that the eigensystem reveals overall information about the 
trading behavior on the market, when transformed into a trader to trader network. 
On the other hand an analysis of each trader and his or her trading behavior towards 
other traders can be detected at the same time. Since the method used is an eigenanal- 
ysis the absolute value of each eigenvector component is similar to the eigenvector 
centrality used e.g. by Google (Brin and Page (1998)) to define relevant actors in a 
graph. Our approach though allows a decomposition of the market in the distinguish- 
able trading patterns respectively subgroups of traders. 

To visualize and illustrate the results of the eigensystem analysis as a graph, we 
have taken the respective subgraph which shows the relevant actors as found by the 
eigensystem analysis, embedded into the network of all their trading counterparts in 
Figure 3. As can be seen the relevant actors really have many connections within 
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Fig. 2. Reduced adjacency matrix entries for the traders within the most prominent clusters 
among themselves 




Fig. 3. Unweighted subgraph of the traders within the most prominent clusters to all related 
traders 

the market and even amongst each other, which again validates the results of the 
eigensystem analysis. 



5 Conclusion 

As manipulation within electronic trading systems is limited to behavioral aspects 
and the usual amount of data is quite high, irregular acting is likely to remain hidden 
within the mass of data when using naive fraud analysis techniques. Also structural 
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effects of networks blur a clear view. We found that a recursive network analysis 
approach, facilitated by a trader-to-trader network supports the discovery of irregular 
patterns. Especially by means of the chosen network those traders are followed who 
try to use the network in their own favor and thus break the anonymity assumed by 
the market system. 

Further research will focus on the analysis of the mix of several patterns, the 
detection of plain patterns in very noisy trading data as well as the weight func- 
tions for the edges within the network transaction graph. On the side of the analysis 
technique, comparison of traditional stock market measurements and the measures 
that arise from the approach of analyzing the behavioral aspects in electronic trading 
systems in a network analysis context are of special interest. 
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Abstract. A Balanced Scorecard is more than a business model because it moves perfor- 
mance measurement to performance management. It consists of performance indicators which 
are inter-related. Some relations are hard to find, like soft skills. We propose a procedure to 
fully specify these relations. Three types of relationships are considered. For the function types 
inverse functions exist. Each equation can be solved uniquely for variables at the right hand 
side. By generating noisy data in a Monte Carlo simulation, we can specify function type and 
estimate the related parameters. An example illustrates our procedure and the corresponding 
results. 



1 Related work 

Indicator systems are appropriate instruments to define business targets and to mea- 
sure management indicators together. Such a system should not be just a system of 
hard indicators; it should be used as a system with control in which one can bring 
hard indicators and management visions together. 

In the beginning of the 90’s Johnson and Kaplan (1987) published the idea how 
to bring a company’s strategy and used indicators together. This system, also known 
as Balanced Scorecards (BSC), is developed until now. 

The relationships between those indicators are hard to hnd. According to Marr 
(2004), companies understand better their business if they visualise relations between 
available indicators. However, some indicators influence each ofher in cause and 
effect relations which increases the validity of these indicators. Unusually, compared 
to a study of Ittner et al (2003) and Marr (2004) 46% of questioned companies do 
not or are not able to visualise cause-and-effect relations of indicators. 

Several approaches try to solve the existing shortcomings. 

A possible way to model fuzzy relations in a BSC is described in Nissen (2006). 
Nevertheless, this leads to restrictions in the variable domains. 
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Blumenberg et al (2006) concentrate on Bayesian Belief Networks (BBN) and 
try to predict value chain figures and enhanced corporate learning. The weakness of 
this prediction method is that it does not contain any loops which BSCs may contain. 
Loops within BSCs must be removed if BBN are used to predict causes and effects 
in BSCs. 

Banker et al (2004) suggest calculating trade-offs between indicators. The weak- 
ness of this solution is that they concentrate on one financial and three nonfinancial 
performance indicators and try to derive management decisions. 

A totally different way of predicting relations in BSCs is the usage of system 
dynamics. System Dynamics is usually used to simulate complex dynamic systems 
(Forrester (1961)). Various publications exist of how to combine these indicators 
with dynamics systems to predict economic scenarios in a company, e.g. Akkermans 
et al (2002). In contrast to these approaches we concentrate on existing performance 
indicators and try to predict relationships between these indicators instead of pre- 
dicting economic scenarios. It is similar to the methods of system identihcation. In 
contrast, our approach calculates in a more flexible way all models within the de- 
scribed model classes (see section 3). 



2 Balanced scorecards 

”If you can’t measure it, you can’t manage it” (Kaplan and Norton (1996), p. 21). 
With this sentence the BSC inventors Kaplan and Norton made a statement which 
describes a common problem in the industry: you can not manage a company if 
you don’t have performance indicators to manage and control your company. Kaplan 
and Norton presented the BSC - a management tool for bringing the current state 
of the business and the strategy of the company together. It is a result of previous 
indicator systems. Nevertheless, a BSC is more than a business system (Friedag & 
Schmidt 2004). Kaplan & Norton (2004) emphasise this in their further development 
of Strategy Maps. 

However, what are these performance indicators and how can you measure it. 
PreiSSner (2002) divides the functionality of indicators into four topics: operational- 
isation (’’indicators should be able to reach your goal”), animation (”a frequent mea- 
surement gives you the possibility to recognise important changes”), demand (”it can 
be used as control input”) and control (”it can be used to control the actual value”). 
Nonetheless, we understand an indicator as defined in (Lachnit 1979). 

But before a decision is made which indicator is added to the BSC and the corre- 
sponding perspective the importance of the indicator has to be evaluated. Kaplan & 
Norton divide indicators additionally into hard and soft, short and long-term objec- 
tives. They also consider cause and effect relations. The three main aspects are: 1. All 
indicators that do not make sense are not worthwhile being included into a BSC; 2. 
While building a BSC, a company should differentiate between performance and re- 
sult indicators; 3. All non-monetary values should influence monetary values. Based 
on these indicators we are now able to build up a complete system of indicators which 




A Procedure to Estimate Relations in a Balanced Scorecard 365 



turns into or influences each other and seeks a measurement for one of the follow- 
ing four perspectives: (1) Financial Perspective to reflect the financial performance 
like the return on investment; (2) Customer Perspective to summarize all indicators 
of the customer/company relationships; (3) Business Process Perspective to give an 
overview about key business processes; (4) Learning and Growth Perspective which 
measures the company’s learning curve. 




Fig. 1. BSC Example of a domestic airline 



By splitting a company into four different views the management of a company 
gets the chance of a quick overview. The management can focus on its strategic goal 
and is able to react in time. They are able to connect qualitative performance indi- 
cators with one or all business indicators. Moreover the construction of an adequate 
equation system might be impossible. 

Nevertheless the relations between indicators should be elaborated and an approx- 
imation of the relations of these indicators should be considered. In this case mul- 
tivariate density estimation is an appropriate tool for modeling the relations of the 
business. Figure 1 shows a simple BSC of an airline company. Profitability is the 
main figure of interest but additionally seven more variables are useful for manag- 
ing the company. Each arc visualizes the cause and effect relations. This example is 
taken from "The Balanced Scorecard Institute" ^ 



www.balancedscorecard.org 





366 Veit Koppen et al. 

3 Model 



To quantify the relationships in a given data set different methods for parameter esti- 
mation are used. Measurement errors within the data set are allowed, but these errors 
are assumed to have a mean value of zero. For each indicator within the data set no 
missing data is assumed. To quantify the relationships correctly it is further assumed 
that intermediate results are included in the data set. Otherwise the relationships will 
not be covered. Heteroscedasticity as well as autocorrelations of the data is not con- 
sidered. 

3.1 Relationships, estimations and algorithm 

In our procedure three different types of relationships are investigated. The first two 
function types are unknown because the operators linking the variables are unknown: 

z = f{x,y)=x®y (1) 

where 0 represent an addition or a multiplication operator. The third type includes a 
parametric type of real valued function: 

{ p x<a 

a<x<b (2) 

q x> b 

with 0 = (abcdgh) and p = + h and q = Note, that all three 

function types are assumed to be separable, i.e. uniquely solvable for x or y in 1 
and X in 2. Thus forward and backward calculations in the system of indicators are 
possible. As a data set is tested independently with respect to the described function 
types a Sidak correction has to be applied (cf. Abdi (2007)). 

Additive relationships between three indicators (T = X\+X 2 ) are detected via 
multiple regression. The model is: 



F = |3o + |3i-Xi + |32-X2 + m (3) 

where u ~ A(0,a^). The relationship is accepted if level of significance of all ex- 
planatory variables is high and 3o = 0, |3i = 1 and (32 = 1. The multiplicative rela- 
tionship Y = Al • ^2 is detected by the regression model: 

T = [3o-f(3i -Z+M withZ = Ai • A 2 ,m ~ A(0,a^). (4) 

The relationship is accepted if the level of significance of the explanatory variable 
is high and |3o = 0 and 3i = 1. The nonlinear relationship between two indicators 
according to equation 2 is detected by parameter estimation based on nonlinear re- 
gression: 

Y= — r+h+u Vfl < X < &;m ~ A(0,a^). 



(5) 
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In a first step the indicators are extracted from a business database, files or 
tools like excel spreadsheets. The number of extracted indicators is denoted by n. 
In the second step all possible relationships have to be evaluated. For the multiple 
regression scenario cases are relevant. Testing multiplicative relationships 

demands 2 -{n- 3 )\ The nonlinear regression needs to be performed 

times. All regressions are performed in R. The univariate and the multivariate linear 
regression are performed with the Im function from the R-base stats package. The 
nonlinear regression is fitted by the nls function in the stats package and the level of 
significance is evaluated. If additionally the estimated parameter values are in given 
boundaries the relationship is accepted. 

The pseudo code of the the complete environment is given in algorithm 3.1. 



Algorithm 1 Estimation Procedure 

Require: data matrix data[Mtxn\ with t observations for n indicators 
significance level, boundaries for parameter 
Ensure: detected relationships between indicators 
1: for i = 1 to n — 2 AND j=i + lton— 1 AND k = j + 1 to n do 
2: estimation by lm(data[,i] data[,j] + data[,k]) 

3: if significant AND parameter estimates within boundaries then 

4: Relationship ’’Addition” found 

5: end if 

6: end for 

7: for i = 1 to n AND j = 1 to n — 1 AND fc = j + 1 to n do 
8: if i != j AND i != k then 

9: set Z := data[,j] • data[,k] 

10: estimation by lm(data[,i] Z) 

11 : if significant AND parameter estimates within boundaries then 

12: Relationship ’’Multiplication” found 

13: end if 

14: end if 

15: end for 

16: for i = 1 to n AND j = 1 to n do 
17: if i != j then 

18: estimation by nls(data[,j] c/(l+exp(-d+g*data[,i])) + h) 

19: if significant then 

20: ’’Nonlinear Relationship” found 

21 : end if 

22: end if 

23: end for 



4 Case study 

For our case study we create an artificial model with 16 indicators and 12 relation- 
ships, see Fig. 2. It includes typical cases of the real world. 
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Indicators 1-4 are independently and randomly distributed. In Fig. 2 they are dis- 
played in grey and represent the basic input for the simulated BSC system. All other 
indicators are either functional dependent on two indicators related hy an addition or 
multiplication or functional dependent on an indicator according to equation 2. Some 
of these indicators effect other quantities or represent leaf nodes in the BSC model 
graph, cf. Fig. 2. Based on the fact that indicators may not he precisely measured 
we add noise to some indicators, see Tab. 1. Note, that IndicatorPlus4 has a skewed 
added noise whereas the remaining added noise is symmetrical. 

In our case study we hide all given relationships and try to identify them, cf. 
section 3. 



Table 1. Indicator Distributions and Noise 



Indicator 


Distribution 


Indicator 


added Noise 


Indicator 


Noise 


Indicatorl 

Indicator2 

Indicators 

Indicatord 


V(100,102) 

V(40,22) 

C(-10,10) 

E{2) 


IndicatorPlusl 
IndicatorPlus4 
IndicatorMultiply 1 
IndicatorMultiply4 


N(0,1) 

£(1)-1 

N(0,1) 

c(-i,i) 


IndicatorExpl 

IndicatorExp4 


^(0,1) 

c(-i,i) 



5 Results 

The case study runs in three different stages: with Ik, 10k, and 100k randomly dis- 
tributed data. The results are similar and can be classified into four cases: (1) if a 
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relation exists and it was found (displayed black in Fig. 3), (2) if a relation was found 
but does not exist (displayed with a pattern in Fig. 3) (error of the second kind), (3) 
if no relation was found but one exists in the model (displayed white in Fig. 3) (error 
of the hrst kind), and (4) if no relation exists and no one was found. Additionally the 
results have been split according to the operator class (see Tah. 2). 



Table 2. Identification Results 



Observations 


+ 


Ik 

* 


Exp 


+ 


10k 


Exp 


+ 


100k 

* 


Exp 


(2) 


0 


3 


27 


0 


5 


48 


0 


2 


49 


(3) 


1 


0 


3 


1 


0 


3 


1 


0 


3 




560 


1680 


240 


560 


1680 


240 


560 


1680 


240 



Hence, Tah. 2 shows that the results for all experiments are similar for the oper- 
ators addition and multiplication. For non-linear regression, relationships could not 
be discovered properly. 

The additive relation of IndicatorPlus4 was the only non-detective relation, see 
observation (3) in Tab. 2. This is caused by the fact that the indicator has an added 
noise which is skewed. In such a case the identification is not possible. 



(indicatorMultiply 4 ) 



[IndicatorMultiply 




C indicatorExpl ^ ( IndicatorMultiply 1 1 



[IndicatorMultiply 2 ) 



( lndicatorExp4 ) ^ indicatorExp 3 ) 



Fig. 3. Results of the Artificial Example for 100k observations 
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6 Conclusion and outlook 



Traditional regression analysis allows estimating the cause and effect dependencies 
within a proht seeking organization. Univariate and multivariate linear regression 
exhibit the best results whereas skewed noise in the variables destroys the possibility 
to detect these relationships. 

Non-linear regression has a high error output due to the fact that optimization 
has to be applied and starting values are not always at hand. The results from the 
non-linear regression should only be carefully taken into account. 

In future work we try to improve our results while removing indicators for which 
we calculate a nearly 100% secure relationship. Additionally we plan to work on real 
data which also includes the possibility of missing data for indicators. Research aims 
at creating a company’s BSC with relevant business figures while looking only at a 
company’s indicator system. 
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Abstract. The manual customisation of reference models to suite special purposes is an ex- 
haustive task that has to he accomplished thoroughly to preserve, explicit and extend the inherit 
intention. This can be facilitated by the usage of automatisms like those being provided by the 
Configurative Reference Modelling approach. Thus, the reference model has to be enriched 
by data describing for which scenario a certain element is relevant. By assigning this data to 
application contexts, it builds a taxonomy. This paper aims to illustrate the advantage of the 
usage of this taxonomy during three relevant phases of Configurative Reference Modelling, 
Project Aim Definition, Construction and Configuration of the configurable reference model. 



1 Introduction 

Reference information models - in this context solely called reference models - give 
recommendations for the structuring of information systems as best or common prac- 
tices and can be used as a starting basis for the development of application specific 
information system models. The better the reference models are matched with the 
special features of individual application contexts, the bigger the benefit of reference 
model use. Configurable reference models contain rules that describe how different 
application specific variants are derived. Each of these rules is placed together with 
a condition and an implication. Each condition describes one application context of 
the reference model. The respective implication determines the relevant model vari- 
ant. Eor describing the application contexts configuration parameters are used. Their 
specification forms a taxonomy. Based upon a procedure model this paper highlights 
the usefulness of taxonomies in the context of Configurative Reference Modelling. 
Thus, the paper is structured as follows: Eirst, the Configurative Reference Modelling 
approach and its procedure model is being described. Afterwards, the usefulness of 
the application of taxonomies is being shown during the respective phases. An out- 
look on future research areas concludes the paper. 
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2 Configurative Reference Modelling and the application of 
taxonomies 

2.1 Configurative Reference Modelling 

Reference models are representations of knowledge recorded by domain experts to 
be used as guidelines for every day business as well as for further research. Their 
purpose is to structure and store knowledge and give recommendations like best or 
common practices. They should be of general validity in terms of being applicable for 
more than one user (see Schuette (1998); vom Brocke (2003); Fettke, Loos (2004)). 
Currently 38 of them have been clustered and categorised, spanning domains like 
logistics, supply chain management, production planing and control or retail (see 
Braun, Esswein (2006)). 

General applicability is a necessary requirement for a model to be characterised 
as reference model, as it has to grant the possibility to be adopted by more than one 
user or company. Thus, the reference model has to include information about dif- 
ferent business models, different functional areas or different purposes for its usage. 
A reference model for retail companies might have to cover economic levels like 
Retail or Wholesale, trading levels like Inland trade or Foreign trade as well as func- 
tional areas like Sales, Production Planning and Control or Human Resource Man- 
agement. While this constitutes the general applicahlllty for a certain domain, one 
special company usually needs just one suitable instance of this reference model, for 
example Retail/Inland Trade, leaving the remaining information dispensable. This 
yields the problem that the perceived demand of information for each individual will 
be hardly met. The information delivered - in terms of models of different types 
which might consist of different element types and hold different element instances 
- might either be too little or too extensive, hence the addressee will be overburdened 
on the one hand or insufficiently supplied with information on the other hand. Con- 
sequently, a person requiring the model for the purpose of developing the database 
of a company might not want to be burdened with models of the technique Event- 
driven Process Chain (EPC), whose purpose is to describe processes, but with Entity 
Relationship Model (ERM), used to describe data structures. To compensate this in 
a conventional manner, a complex manual customisation of the reference model is 
necessary to meet the addressees demand. Another implication is the maintenance 
of the reference model. Every time changes are committed to the reference model, 
every Instance has to be manually updated as well. 

This is where Configurable Reference Models come into operation. The basic 
idea is to attach parameters to elements of the integrated reference model in ad- 
vance, defining the contexts to which these elements are relevant (see e. g. Knack- 
stedt (2006)). In reference to the example given above this means that certain ele- 
ments of the model might just be relevant for one of the economic levels - retail or 
wholesale -, or for both of them. The user eventually selects the best suited parame- 
ters for his purpose and the respective configured model is generated automatically. 
This leads to the conclusion that the lifecycle of a configurable reference model can 
be divided into two parts called Development and Usage (see Schlagheck (2000)). 
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The first part - relevant for the reference model developer - consists of the phases 
Project Aim Definition, Model Technique Definition, Model Construction and Evalu- 
ation for the developer, whereas the second one - relevant for the user - includes the 
phases Project Aim Definition, Search and Selection of existing and suitable refer- 
ence models and Model Configuration. The configured model can be further adapted 
to satisfy individual needs (see Becker et al. 2004). Several phases can be identified, 
where the application of taxonomies can be of value, especially Project Aim Defini- 
tion and Model Construction (for the developer) and Model Configuration (for the 
user). Fig. 1 gives an overview of the phases, where the ones that will be discussed 
in detail are solid, the ones actually not relevant are greyed out. The output of both 
Development and Usage is printed in italics. 
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Project Aim 
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Fig. 1. Development and Usage of Configurable Reference Models 



2.2 Project aim definition 

During the first phase. Project Aim Definition, the developers have to agree on the 
purpose of the reference model to build. They have to decide for which domain the 
model should be used, which business models should be supported, which func- 
tional areas should be integrated to support the distribution for different perspectives 
and so on. To structure these parameters, a morphological box has become appar- 
ent to be applicable. First, all instances for each possible characteristic have to be 
listed. By shading the relevant parameters for the reference model, the developers 
commit themselves to one common project aim and reduce the given complexity. 
Thus, the emerging morphological box constitutes a taxonomy, implying the vari- 
ants included in the integrated configurative reference model (see fig. 2; Merfens, 
Lohmann (2000)). By generating this taxonomy, the developers get aware of all 
possible included variants, thus getting a better overview of the to-fee-state of the 
model. One special variant of the model will later on be generated by choosing one 
or a set of the parameters by the user. The choice of parameters should be sup- 
ported by an underlying ontology that can be used throughout both Development 
and Usage (see Knackstedt et al. (2006)). The developers have to decide whether 
or not dependencies between parameters exist. In some cases, the choice of one 
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Characteristic 


Characteristic form 


Business level 


Retailer 


Wholesaler 


Retail Business 


Third-party business 


Pooled payment 
business 


Promotion business 


Service business 




Inland trade 


Foreign trade 


Horizontal 

cooperation 


Retailers 


Wholesalers 


other cooperation 


Vertical 

Cooperation 


Retail and wholesale 


Wholesale and 
industrial companies 




Retail, wholesale and 
Industrial companies 




stationary 


Itinerant 


Mail Order 






Self-service 


Catalog 


Vending Machine 




Investment goods trade 


Consumer goods trade 


Range extend 


Wide and deep range 


Wide and shallow 
range 


Narrow and deep 
range 


Narrow and shallow 
range 


Pricing policy 


Active 


Passive 


Purchase initiation 
throuah 




Logistics handling 


By the customer (collect) 




liiiliiiiiil 



Fig. 2. Example of a morphological box, used as taxonomy. Becker et al. (2001) 



specific parameter within one specific characteristic determines the necessity of an- 
other parameter within another characteristic. For example, the developers might 
decide that the choice of ContactOrientation=MailOrder determines the choice 
of PurchaseInitiationThrough=AND ( Internet ; Letter /Fax) . 

2.3 Construction 

During the Model Construction phase, the configurable reference model has to be 
developed in regards to the decisions made during the preceding phase Project Aim 
Definition. The example in fig. 3 illustrates an EPC regarding the payment of a 
bill, distinguishing whether the bill originates from a national or an international 
source. If the origin of the bill is national, it can be paid immediately, otherwise it 
has to be cross-checked by the international auditing. This scenario can only take 
place, if both instances of the characteristic TradingLevel, namely InlandTrade 
and ForeignTrade, are chosen. If all clients of a company are settled abroad or (in 
the meaning of an exclusive or) all of them are inland, the check for the origin is 
not necessary. The cross-check with the international auditing has only to take place, 
if the bill comes from abroad. To store this information in the model, the respec- 
tive parameters are attached to the respective model elements in form of a term and 
can later be evaluated to true or false. Only if the equation is evaluated to true or 
if there is no term attached to an element, the respective element may remain in the 
configured model. Thus, for example, the function check for origin stays, if the term 
TradingLevel^AND (Foreign; Inland) is true, which happens if both parameters 
are selected. If only one is selected, the equation returns /aZie and the element will 
be removed from the model. 


























Taxonomies in the Context of Configurative Reference Modelling 377 



Configurable Configured model for parameter 

Reference Model and (Foreign; inland) Foreign Inland 




Fig. 3. Annotated parameters to elements, resulting model variants 



To specify these terms, which can get complex if many characteristics are used, a 
term editor application has been developed, which enables the user to attach them 
to the relevant elements. Here again, the ontology can support the developer by 
automatically testing for correctness and reasonableness of dependent parameters 
(see Knackstedt et al. (2006)). Opposite to dependencies, exclusions take into ac- 
count that under certain circumstances parameters may not be chosen together. This 
minimises the risk of defective modelling and raises the consistency level of the 
configurable reference model. In the example given above, if the developer selects 
SalesContactForm^VendingMachine, the parameter Beneficiary may not be 
InvestmentGoodsTrade, as investment goods can hardly be bought via a vend- 
ing machine. Thus, the occurrence of both statements concatenated with a logical 
AND is not allowed. The same fact has to be regarded when evaluating dependencies: 
If, like stated above, ContactOrientation=MailOrder determines the choice of 
PurchaseInitiationThrough=AND ( Internet /Letter /Fax) , the same statement 
may not occur with a preceded NOT. Again, the previously generated taxonomy can 
support the developer by structurising the included variants. 

2.4 Configuration 

The Usage phase of a configurable reference model starts independently from its de- 
velopment. During the Project Aim Definition phase the potential user defines the pa- 
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rameters to determine which reference model best meets his needs. He has to search 
for it during the Search and Selection phase. Once the user has selected a certain 
configurable reference model, he uses its taxonomy to pick the parameters relevant 
to his purpose. By automatically including dependent parameters, the ontology can 
be of assistance in the same way as before, assuring that the mistakes made by the 
user are reduced to a minimum (see Knackstedt et al. (2006)). For each parameter 
- or set of parameters - a certain model variant is created. These variants have to 
be differentiated by the aim of the configuration. On the one hand, the user might 
want to configure a model that cannot be further adapted. This happens if a maxi- 
mum of one parameter per characteristic is chosen. In this case, the ontology has to 
consider dependencies as well as exclusions. On the other hand, if the user decides to 
configure towards a model variant that should be configured again, exclusions may 
not be considered. Both possibilities have to be covered by the ontology. Further- 
more, a validation should cross-check against the ontology that no terms exist that 
always equate to false. If an element is removed in every configuration scenario, it 
should not have been integrated into the reference model in the first place. Thus, the 
taxonomy can assist the user during the configuration phase by offering a set of pa- 
rameters to choose from. Combined with an underlying ontology, the possibility of 
making mistakes by using the taxonomy during the model adaptation is reduced to a 
minimum. 



3 Conclusion 

As well as the ontology, the taxonomy used as a basic element throughout the phases 
of Configurative Reference Modelling has to meet certain demands. Most impor- 
tantly, the developers have to carefully select the constituting characteristics and as- 
sociated parameters. It has to be possible for the user to distinguish between several 
options, so they can make a clear decision to configure the model towards the variant 
relevant for his purpose. This means that each parameter has to be understandable 
and be delimited from the others, which - for example - can be arranged by supply- 
ing a manual or guide. Moreover, the parameters may neither be too abstract nor too 
detailed. The taxonomy can be of use during the three relevant phases. As mentioned 
before, the user has to be assisted in the usage of the taxonomy by automatically in- 
cluding or excluding parameters as defined by the ontology. Furthermore, only such 
parameters should be chosen, that have an effect on the model that is comparative 
to the necessary effort to identify it. Parameters that have no effect at all or are not 
used should be removed as well, to decreases the complexity for both the developer 
and the user. If the choice of a parameter results in the removal of only one element 
and its identification takes a very long time, it should be removed from the taxon- 
omy because of its little effect at high costs. Thus, the way the adaptation process is 
supported by the taxonomy strongly depends on the associated ontology. 
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4 Outlook 

The resulting effect of the selection of one parameter to configure the model shows its 
relevance and can be measured either by the quantity or by the importance of the el- 
ements that are being removed. Each parameter can be associated with a certain cost 
that emerges due to the time it takes the user to identify it. Thus, cheap parameters are 
easy to identify and have a huge effect once selected. Expensive parameters instead 
are hard to identify and have little effect on the model. Further research should first 
try to benchmark, which combinations of parameters of a certain reference model are 
chosen most often. In doing so, the developer has the chance to concentrate on the 
evolution of these parts of the reference model. Second, it should be possible to iden- 
tify cheap parameters by either running simulations on reference models, measuring 
the effect a parameter has - even in combination with other parameters -, or by au- 
diting the behavior of reference model users - which is feasible in a limited way due 
to the small distribution of configurable reference models. Third, configured models 
should be rated with costs, so cheap variants can be identified and - the other way 
round - the responsible parameters can be identified. To sum up, a objective function 
should be developed, enabling the calculation of the costs for the configuration of a 
certain model variant in advance by giving the selected parameters as input. It should 
have the form C{MV) = YTk=i C{MV) being the cost function of a certain 

model variant derived from the reference model by using n parameters, C{Pk) being 
the cost function of a single parameter and R{Pk) being a function weighting the rel- 
evance of a single parameter P, which is used for the configuration of the respective 
model variant. Furthermore, the usefulness of the application of the taxonomy has to 
be evaluated by empirical studies in every day business. This will be realised for the 
configuration phase by integrating consultancies into our research and giving them a 
taxonomy for a certain domain at hand. With the application of supporting software 
tools, we hope that the adoption process of the reference model can be facilitated. 



References 

BECKER, J., DELEMANN, P. and KNACKSTEDT, R. (2004): Konstmktion von Referenz- 
modellierungssprachen - Ein Ordnungsrahmen zur Spezifikation von Adaptionsmecha- 
nismen fuer Informationsmodelle. Wirtschqftsinfornuitik, 46, 4, 251 -264. 

BECKER, J., UHR, W. and VERING, O. (2001): Retail Information Systems Based on SAP 
Products. Springer Verlag, Berlin, Heidelberg, New York. 

BRAUN, R. and ESSWEIN, W. (2006): Classification of Reference Models. In: Advances 
in Data Analysis: Proceedings of the 30th Annual Conference of The Gesellschaft fuer 
Klassifikation e.Y, Freie Universitaet Berlin, March 8-10, 2006. 

DELEMANN, R, JANIESCH, C., KNACKSTEDT, R., RIEKE, T. and SEIDEL, S. (2006): 
Towards Tool Support for Configurative Reference Modelling - Experiences from a Meta 
Modeling Teaching Case. In: Proceedings of the 2nd Workshop on Meta-Modelling and 
Ontologies (WoMM 2006). Lecture Notes in Informatics. Karlsruhe, Germany, 61 - 83. 

EETTKE, P. and LOOS, P. (2004): Referenzmodelliemngsforschung. Wirtschaftsinformatik, 
46, 5, 331-340. 




380 



Ralf Knackstedt and Armin Stein 



KNACKSTEDT, R. (2006): Fachkonzeptionelle Referenzmodellierung einer Managementun- 
terstuetzung mit quantiativen und qualitativen Daten. Methodische Konzepte zur Kon- 
struktion und Anwendung. Logos-Verlag, Berlin. 

KNACKSTEDT, R„ SEIDEL, S. and JANIESCH, C. (2006): Konfigurative Referenzmodel- 
liemng zur Fachkonzeption von Data-Warehouse-Systemen mit dem H2-Toolset. In: J. 
Schelp, R. Winter, U. Frank, B. Rieger, K. Turowski (Hrsg.): Integration, Information- 
slogistik und Architektur. DW2006, 21.-22. Sept. 2006, Friedrichshafen. Lecture Notes 
in Informatics. Bonn, Germany, 61-81. 

MERTENS, R and LOHMANN, M. (2000): Branche oder Betriebstyp als Klassifikationskri- 
terien fuer die Standardsoftware der Zukunft? Erste Ueberlegungen, wie kuenftig be- 
triebswirtschaftliche Standardsoftware entstehen koennte. In: F. Bodendorf, M. Grauer 
(Hrsg.): Verbundtagung Wirtschaftsinformatik 2000. Shaker Verlag, Aachen, 110-135. 

SCHLAGHECK, B. (2000): Objektorientierte Referenzmodelle fuer das Prozess- und Pro- 
jektcontroUing. Grundlagen - Konstruktion - Anwendungsmoeglichkeiten. Deutscher 
Universitaets-Verlag, Wiesbaden. 

SCHUETTE, R. (1998): Grundsaetze ordnungsmaessiger Referenzmodellierung. Konstruk- 
tion konfigurations- und anpassungsorientierter Modelle. Deutscher Universitaets- 
Verlag, Wiesbaden. 

VOM BROCKE, J. (2003): Referenzmodellierung. Gestaltung und Verteilung von Konstruk- 
tiomprozessen. Logos Verlag, Berlin. 




Two-Dimensional Centrality of a Social Network 



Akinori Okada 

Graduate School of Management and Information Sciences 

Tama University, 4-1-1 Hijirigaoka Tama-shi, Tokyo 206-0022, Japan 

okada@tama.ac.jp 



Abstract. A procedure of deriving the centrality in a social network is presented. The pro- 
cedure uses the characteristic values and the vectors of a matrix of friendship relationships 
among actors. While the centrality of an actor has been usually derived by the characteristic 
vector corresponding to the largest characteristic value, the present study uses not only the 
characteristic vector corresponding to the largest characteristic value but also that correspond- 
ing to the second largest characteristic value. Each actor has two centralities. The interpretation 
of two centralities, and the comparison with the additive clustering are presented. 



1 Introduction 

When we have a symmetric social network among a set of actors, where the relation- 
ship from actors j to k is equal to the relationship from actors k to j, the centrality 
of each actor who constitutes a social network Is very important to find the features 
and the structure of the social network. The centrality of an actor represents the im- 
portance, significance, power, or popularity of the actor to form relationships with 
the other actors in the social network. Several procedures to derive the centrality of 
each actor in the social network have been introduced (ex. Hubbell (1965)). Bonacich 
(1972) introduced a procedure to derive the centrality of an actor by using the char- 
acteristic (eigen) vector of a matrix of friendship relationships or friendship choices 
among a set of actors. The matrix of friendship relationships which is dealt with by 
these procedures is assumed to be symmetric. 

The procedure of Bonacich (1972) is based on the characteristic vector corre- 
sponding to the largest characteristic (eigen) value. Each element of the characteristic 
vector represents the centrality of each actor. The procedure has one good property 
that the centrality of an actor is defined recursively by the weighted sum of the cen- 
tralities of all actors, where the weight is the strength of the friendship relationship 
between the actor and the other actors. The procedure was extended to deal with 
an asymmetric matrix of friendship relationships (Bonachich (1991)), where (a) the 
relationship from actors j to k is not same as that from actors k to j or (b) rela- 
tionships between a set of actors and another set of actors. The first case (a) means 
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the one-mode two-way data, and the second case (b) means the two-mode two-way 
data. These procedures utilized the characteristic vector which corresponds to the 
largest characteristic value. Wright and Evitts (1961) also introduced a procedure to 
derive the centrality of an actor utilizing the characteristic vectors which correspond 
to more than one (largest) characteristic value. While Wright and Evitts (1961) say 
the purpose is to derive the centrality, they focus their attention to summarize the re- 
lationships among actors just like applying factor analysis to the matrix of friendship 
relationships. 

The purpose of the present study is to introduce a procedure to derive the cen- 
trality of each actor of a social network by using the characteristic vectors which 
correspond to more than one largest characteristic value of the matrix of friendship 
relationships. Although the present procedure is based on more than one character- 
istic vectors, the purpose is to derive the centrality of actors but not to summarize 
relationships among actors in a social network. 



2 The procedure 

The present procedure deals with a symmetric matrix of friendship relationships. 
Suppose we are dealing with a social network consisits of n actors. Let A be an 
nxn matrix representing friendship relationships among actors in a social network. 
The (j,k) element of A, aji^, represents the relationship between actor j and k; when 
actors j and k are friends each other 



^jk 1 5 ( 1 ) 

and when actors j and k are not friends each other 

ajk = 0. ( 2 ) 

Because the relationships among actors are symmetric, the matrix A is symmetric; 
^ jk — ^kj • 

The characteristic vectors of nxn matrix A which correspond to two largest char- 
acteristic values are derived. Each characteristic value represents the salience of the 
centrality represented by the corresponding characteristic vector. The 7 th element of 
a characteristic vector represents the centrality of actor j along the feature or the 
aspect represented by the corresponding characteristic vector. 



3 The analysis and the result 

In the present study, the social network data among 16 families were analyzed 
(Wasserman and Faust (1994, p. 744, Table B 6 )). The data show the marital rela- 
tionships among 16 families. Thus the actor in the present data is the family. The 
relationships are represented by a 16x 16 matrix. Each element represents whether 
there was a marital tie between two families corresponding to a row and a column 
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(Wasserman and Faust (1994, p. 62)). The {j,k) element of the matrix is equal to 1, 
when there is a marital tie between families j and k, and is equal to 0, when there 
is no marital tie between families j and k. In the present analysis, the unity was 
embedded in the diagonal elements of the matrix of friendship relationships. 

The five largest characteristic values of the 16x 16 friendship relationship matrix 
were 4.233, 3.418, 2.704, 2.007, and 1.930. The corresponding characteristic vectors 
for the two largest characteristic values are shown in the second and the third columns 
of Table 1. 



Table 1. Characteristic vectors 



Actor (Family) 


Dimension 1 

Characteristic values 
4.233 


Dimension 2 
3.418 


1 Acciaiuoli 


0.129 


0.134 


2 Albizzi 


0.210 


0.300 


3 Barbadori 


0.179 


0.053 


4 Bischeri 


0.328 


-0.260 


5 Castellani 


0.296 


-0.353 


6 Ginori 


0.094 


0.123 


7 Guadagni 


0.283 


0.166 


8 Lamberteschi 


0.086 


0.076 


9 Medici 


0.383 


0.434 


10 Pazzi 


0.039 


0.117 


11 Peruzzi 


0.339 


-0.385 


12 Pucci 


0.000 


0.000 


13 Ridolfi 


0.301 


0.124 


14 Salviati 


0.137 


0.236 


15 Strozzi 


0.404 


-0.382 


16 Tornabuoni 


0.281 


0.285 



Two characteristic values are 4.233 and 3.418 each of which represents the rela- 
tive salience of the centrality over the all 16 actors along the feature or aspect shown 
by each of the two characteristic vectors. The two centralities represent two different 
features or aspects, called Dimensions 1 and 2 (see Figure 1), of the importance, sig- 
nificance, power, or popularity of actors. The second column, which represents the 
characteristic vector corresponding the largest characteristic value, has non-negative 
elements. These figures show the centrality of the 16 actors along the feature or the 
aspects of Dimension 1 . The larger value shows the larger centrality of an actor. Ac- 
tor 15 has the largest value 0.404, and has the largest centrality among the 16 actors. 
Actors 4, 9, 11, and 13 have larger centralities as well. Actor 12 has the smallest 
value 0.000, and has the smallest centrality among the 16 actors. Actors 6, 8, and 10 
also have small centralities. 

The third column represents the characteristic vector corresponding to the sec- 
ond largest characteristic value. While the characteristic vector corresponding to the 
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largest characteristic value represented in the second column has all non-negative 
elements, the characteristic vector corresponding to the second largest characteristic 
value has negative elements. Actors 2 and 9 have larger positive elements. On the 
contrary, actors 4, 5, 11, and 15 have substantive negative elements. The meaning 
and the interpretation of the characteristic vector which corresponds to the second 
largest characteristic value will be discussed in the next section. 



4 Discussion 

Two characteristic vectors each corresponding to the largest and the second largest 
characteristic values represent the centralities of each actor along two different fea- 
tures or aspects of Dimensions 1 and 2. The 16 elements of the first characteris- 
tic vector seem to represent the overall (global) centrality or popularity of an actor 
among the actors in the social network (cf. Scott (1991, pp. 85-89)). For each actor, 
the number of ties with the other 15 actors were calculated. Each of the 16 figures 
shows the overall centrality or popularity of the actor among actors in the social 
network. The correlation coefficient between the elements of the first characteristic 
vector and these figures were 0.90. This tells that the elements of the first characteris- 
tic vector shows the overall centrality or popularity of the actor in the social network. 
This is the meaning of the feature or the aspect given by the first characteristic vector 
of Dimension 1. 

The yth element of the first characteristic vector shows the strength of actor j 
in extending or accepting friendship relationships with the other actors in the social 
network as a whole. The strength of the friendship relationship between actors j and 
k along Dimension 1 is represented by the product of the jth and the kth elements of 
the first characteristic vector. Because all elements of the first characteristic vector 
are non-negative, the product of any two elements of the first characteristic vector is 
non-negative. The larger the product is, the stronger the tie between two actors is. 

The second characteristic vector has the positive (non-negative) and the negative 
elements as well. Thus, there are three cases of the product of two elements of the 
second characteristic vector; 

(a) the product of two non-negative elements is non negative 

(b) the product of two negative elements Is positive, and 

(c) the product of a positive element and a negative element is negative. 

In the case of (a) the interpretation of the element of the second characteristic vector 
is the same as that of the first characteristic vector. But in the cases of (b) and (c), 
it is difficult to Interpret the meaning of the elements by the same manner as that 
for case (a). Because the element of the matrix of friendship relationships was de- 
hned by Equations (1) and (2), the larger value or the positive value of the product 
of any two elements of the second characteristic vector shows the larger or positive 
friendship relationship between two corresponding actors, and the smaller value or 
the negative value shows the smaller or negative (friendship) relationship between 
two corresponding actors. The product of two negative elements of the second char- 
acteristic vector is positive, and the positive hgure shows the positive friendship rela- 




Two-Dimensional Centrality of a Social Network 385 



tionship between two actors. The product of the positive and the negative elements is 
negative, and the negative figure shows the negative friendship relationship between 
two actors. 

The features or the aspect represented by the second characteristic vector can 
be regarded as the local centrality or popularity within a subgroup (cf. Scott (1991, 
pp. 85-89)). As shown in Table 2, some actors have positive and some actors have 
negative elements on Dimension 2 or the second characteristic vector. We can con- 
sider that there are two subgroups of actors; one subgroup consists of actors having 
positive elements of the second characteristic vector, and another subgroup consists 
of those having negative elements of the second characteristic vector, and that two 
subgroups are not friendly. When two actors belong to the same subgroup, the prod- 
uct of the two corresponding elements of the second characteristic vector is positive 
(cases (a) and (b) above), suggesting the positive friendship relationship between two 
actors. On the other hand, when two actors belong to two different subgroups, which 
means that one actor has the positive element and another actor has the negative el- 
ement, the product of the two corresponding elements of the second characteristic 
vector is negative (case (c) above), suggesting the negative friendship relationship 
between two actors. 

Table 1 shows that actor 4, 5, 11, and 15 have negative elements on the second 
characteristic vector. This means that the second characteristic vector suggests two 
subgroups of actors each consists of; 

Subgroup 1: actors 1, 2, 3, 6, 7, 8, 9, 10, (12), 13, 14, and, 16 
Subgroup 2: actors 4, 5, 11, and, 15 

The two subgroups are graphically shown in Figure 1, where the horizontal dimen- 
sion (Dimension 1) corresponds to the first characteristic vector, and vertical dimen- 
sion (Dimension 2) corresponds to the second characteristic vector. Each actor is 
represented as a point having the coordinate of the corresponding element of the first 
characteristic vector on Dimension 1 and that of the second characteristic vector on 
Dimension 2. Figure 1 shows that four members who belong to the second subgroup 
are located closely each other and are separated from the other 12 actors. This seems 
to validate the interpretation of the feature or the aspect represented by the second 
characteristic vector. 

The element of the second characteristic vector represents to which subgroup 
each actor belongs by its sign (positive or negative). The element represents the cen- 
trality of an actor among actors within the subgroup to which the actor belongs, 
because the product of the two elements corresponding to two actors belong to the 
same subgroup is positive regardless of the sign of the elements. The absolute value 
of the element of the second characteristic vector tells the local centrality or popu- 
larity among actors in the same subgroup to which the actor belongs, and the degree 
of periphery or unpopularity among actors in another subgroup to which the actor 
does not belong. The number of ties with actors who are in the same subgroup of 
that actor is calculated for each actor. The correlation coefficient between the abso- 
lute value of the elements of the second characteristic vector and the number of ties 
within a subgroup was 0.85. This tells that the absolute values of the elements of 
the second characteristic vector shows the centrality of an actor in each of the two 
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subgroups. Because the correlation coefficient was derived over the two subgroups, 
the centralities can be compared between subgroups 1 and 2. 



CN 

C 

c 



“■'I 

0.4- 

0.3- 



0.2 - 14 Salviati 
lOPalzi 6Ginori ^ 

1 Acciaiuoli 
" 3 Barbadori 



Lamberte; 



12 Pucci 



-0.5 -0.4 -0.3 -0.2 -0.1 0 
-0.1 H 



2 Albizzi 



9 Medici 



16 Tornabuoni 

7 Guadagni 



chi 



0.1 0.2 0.3 0.4 0.5 

Dimension 1 



- 0.2 - 



4 Bischeri 



-0.3 - 
-0.4 - 
-0.5 - 



5 Castellania 

■ ■ 

11 Peruzzi 15 Strozzi 



Fig. 1. Two-dimensional configuration of 16 families 



The interpretation of the feature or the aspect of the second characteristic vector 
reminds us of the ADCLUS model (Arabie and Carroll (1980); Arable, Carroll, and 
DeSarbo (1987); Shepard and Arabie, (1979)). In the ADCLUS model, each object 
can belong to more than one cluster, and each cluster has its own weight which shows 
the salience of that cluster. Table 2 shows the result of the application of ADCLUS 
to the present friendship relationships data. 



Table 2. Result of the ADCLUS analysis 



Cluster 


Weight 


1 


2 


3 


4 


5 


6 


7 


8 


9 


10 


11 


12 


13 


14 


15 


16 


Cluster 1 


1.88 


0 


0 


0 


1 


1 


0 


0 


0 


0 


0 


1 


0 


0 


0 


1 


0 


Universal 


-0.09 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 



In Table 2, the second row represents whether each of the 16 actors belongs to 
cluster 1 (when the element is 1) or does not belong to cluster 1 (when the element is 
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0). The third row represents the universal cluster, to which all actors belong, repre- 
senting the additive constant of the data (Arabie, Carroll, and DeSarbo (1987, p. 58)). 
As shown in Table 2, actors 4, 5, 11, and 15 belong to cluster 1. These four actors are 
coincide with those having the negative elements of the second characteristic vector 
in Table 1. 

The result derived by the analysis using ADCLUS and the result derived by using 
the characteristic values and vectors are very similar. But they have several different 
points. In the result derived by using ADCLUS, the strength of the friendship rela- 
tionship between two actors is represented as the sum of two terms; (a) the weight 
for the universal cluster, and (b) the weight for cluster 1 if the two actors belong to 
cluster 1. The first term is constant for all combinations of any two actors, and the 
second term is the weight for the first cluster (when two actors belong to cluster 1) 
or zero (when one or none of the two actors belong to cluster 1). In using the char- 
acteristic vectors, the strength of the friendship relationship between two actors are 
represented also as the sum of two terms; (a) the product of the two elements of the 
first characteristic vector, and (b) the product of the two elements of the second char- 
acteristic vector. The first and the second terms are not constant for all combinations 
of two actors but each combination of two actors has its own value, because each 
actor has its own elements on the first and the second characteristic vectors. The hrst 
and the second characteristic vectors are orthogonal, because the matrix of friend- 
ship relationships is assumed to be symmetric, and the two characteristic values are 
different. The correlation coefficient between the first and the second characteristic 
vectors is zero. The clusters derived by the analysis using ADCLUS does not have 
the property even if two or more clusters were derived by the analysis. 

In the present analysis only one cluster was derived by the analysis using AD- 
CLUS. It seems interesting to compare the result derived by ADCLUS having more 
than one cluster with the result based on the characteristic vectors corresponding 
to the third largest and further characteristic values. The comparisons of the present 
procedure with concepts used in the graph theory seem necessary to thoroughly eval- 
uate the present procedure. The present procedure assumes that the strength of the 
friendship relationship between actors j and k is represented by the product of the 
centralities of actors j and k. But the strength of the friendship relationship between 
two actors is dehned as the sum of the centralities of the two actors by using the con- 
joint measurement (Okada (2003)). Which of the product or the sum of two centrali- 
ties is more easily understood, or more practical in applications should be examined. 
The original idea of the centrality has been extended to the asymmetric or rectangu- 
lar social network (Bonacich (1991, 2001)). The present idea can also be extended 
rather easily to deal with the asymmetric or the rectangular case as well. 
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Abstract. The two most popular classification tree algorithms in machine learning and statis- 
tics — C4.5 and CART — are compared in a benchmark experiment together with two other 
more recent constant-fit tree learners from the statistics literature (QUEST, conditional infer- 
ence trees). The study assesses both misclassification error and model complexity on bootstrap 
replications of 18 different benchmark datasets. It is carried out in the R system for statistical 
computing, made possible by means of the RWeka package which interfaces R to the open- 
source machine learning toolbox Weka. Both algorithms are found to be competitive in terms 
of misclassification error — with the performance difference clearly varying across data sets. 
However, C4.5 tends to grow larger and thus more complex trees. 



1 Introduction 

Due to their intuitive interpretability, tree-based learners are a popular tool in data 
mining for solving classification and regression problems. Traditionally, practition- 
ers with a machine learning background use the C4.5 algorithm (Quinlan, 1993) 
while statisticians prefer CART (Breiman, Friedman, Olshen and Stone, 1984). One 
important reason for this is that free reference implementations have not been easily 
available within an integrated computing environment. RPart, an open-source im- 
plementation of CART, has been available for some time in the S/R package rpart 
(Therneau and Atkinson, 1997) while the open-source implementation J4.8 for C4.5 
became available more recently in the Weka machine learning package (Witten and 
Frank, 2005) and is now accessible from within R by means of the RWeka package 
(Hornik, Zeileis, Flothorn and Buchta, 2007). With these software tools available, 
the algorithms can be easily compared and benchmarked on the same computing 
platform: the R system for statistical computing (R Development Core Team 2006). 
The principal concern of this contribution is to provide a neutral and unprejudiced 
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review, especially taking into account classical beliefs (or preconceptions) about per- 
formance differences between C4.5 and CART and heuristics for the choice of hyper- 
parameters. With this in mind, we carry out a benchmark comparison, including dif- 
ferent strategies for hyper-parameter tuning as well as two further constant-fit tree 
models — QUEST (Loh and Shih, 1997) and conditional inference trees (Hothorn, 
Hornik and Zeileis, 2006). The learners are compared with respect to misclassifica- 
tion error and model complexity on each of 1 8 different benchmarking data sets by 
means of simultaneous conhdence intervals (adjusted for multiple testing). Across 
data sets, the performance is aggregated by consensus rankings. 



2 Design of the benchmark experiment 

The simulation study includes a total of six tree-based methods for classification. 
All learners were trained and tested in the framework of Hothorn, Leisch, Zeileis 
and Hornik (2005) based on 500 bootstrap samples for each of 18 data sets. All 
algorithms are trained on each bootstrap sample and evaluated on the remaining out- 
of-bag observations. Misclassification rates are used as predictive performance mea- 
sures, while model complexity requirements of the algorithms under study are mea- 
sured by the number of estimated parameters (number of splits plus number of leafs). 
Performance and model complexity distributions are assessed for each algorithm on 
each of the datasets. In our setting, this results in 108 performance distributions (6 
algorithms on 18 data sets), each of size 500. For comparison on each individual 
data set, simultaneous pairwise confidence intervals (Tukey all-pair comparisons) 
are used. For aggregating the pairwise dominance relations across data sets, median 
linear order consensus rankings are employed following Hornik and Meyer (2007). A 
brief description of the algorithms and their corresponding implementation is given 
below. 

CART/RPart: Classification and regression trees (CART, Breiman et al., 1984) is the 
classical recursive partitioning algorithm which is still the most widely used in 
the statistics community. Here, we employ the open-source reference implemen- 
tation of Therneau and Atkinson (1997) provided in the R package rpart. For 
determining the tree size, cost-complexity pruning is typically adopted: either by 
using a 0- or 1 -standard-errors rule. The former chooses the complexity param- 
eter associated with the smallest prediction error in cross-validation (RPartO), 
whereas the latter chooses the highest complexity parameter which is within 1 
standard error of the best solution (RPartl). 

C4.5/J4.8: C4.5 (Quinlan, 1993) is the predominantly used decision tree algorithm 
in the machine learning community. Although source code implementing C4.5 
is available in Quinlan (1993), it is not published under an open-source license. 
Therefore, the Java implementation of C4.5 (revision 8), called J4.8, in Weka 
is the de-facto open-source reference implementation. For determining the tree 
size, a heuristic confidence threshold C is typically used which is by default 
set to C = 0.25 (as recommended in Witten and Frank, 2005). To evaluate the 
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Table 1. Artificial [*] and non artificial benchmarking data sets 



Data set 


# of obs. 


# of cat. inputs 


# of num. inputs 


breast cancer 


699 


9 


- 


chess 


3196 


36 


- 


circle * 


1000 


- 


2 


credit 


690 


- 


24 


heart 


303 


8 


5 


hepatitis 


155 


13 


6 


house votes 84 


435 


16 


- 


ionosphere 


351 


1 


32 


liver 


345 


- 


6 


Pima Indians diabetes 


768 


- 


8 


promotergene 


106 


57 


- 


ringnorm * 


1000 


- 


20 


sonar 


208 


- 


60 


spirals * 


1000 


- 


2 


threenorm * 


1000 


- 


20 


tictactoe 


958 


9 


- 


titanic 


2201 


3 


- 


twonorm * 


1000 


- 


20 



influence of this parameter, we compare the default J4.8 algorithm with a tuned 
version where C and the minimal leaf size M (default: M = 2) are chosen by 
cross-validation (J4.8(cv)). A full grid search for C = 0.01,0.05,0.1, . . . ,0.5 and 
M = 2, 3, . . . , 10, 15, 20 is used in the cross-validation. 

QUEST: Quick, unbiased and efficient statistical trees are a class of decision trees 
suggested by Loh and Shih (1997) in the statistical literature. QUEST popular- 
ized the concept of unbiased recursive partitioning, i.e., avoiding the variable se- 
lection bias of exhaustive search algorithms (such as CART and C4.5). A binary 
implementation is available from http: //www. s tat .wise . edu/ -loh/ quest . 
html and interfaced in the R package LohTools which is available from the au- 
thors upon request. 

CTree: Conditional inference trees (Hothorn et al., 2006) are a framework of un- 
biased recursive partitioning based on permutation tests (i.e., conditional infer- 
ence) and applicable to inputs and outputs measured at arbitrary scale. An open- 
source implementation is provided in the R package party. 

The benchmarking datasets shown in Table 1 were taken from the popular UCl 
repository of machine learning databases (Newman, Hettich, Blake and Merz, 1998) 
as provided in the R package mlbench. 
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3 Results of the benchmark experiment 

3.1 Results on individual datasets: Pairwise confidence intervals 

Here, we exemplify — using the well-known Pima Indians diabetes and breast cancer 
data sets — how the tree algorithms are assessed on a single data set. Simultaneous 
confidence intervals are computed for all 15 pairwise comparisons of the 6 learners. 
The resulting dominance relations are used as the input for the aggregation analyses 
in Section 3.2. 



Pima Indians Diabetes: Misciassification 



Pima Indians Diabetes: Complexity 



J4.8(cv) - J4.8 
RPartO - J4.8 
RPartI - J4.8 
QUEST- J4.8 
CTree - J4.8 
RPartO - J4.8(cv) 
RPartI - J4.8(cv) 
QUEST- J4.8(cv) 
CTree - J4.8(cv) 
RPartI - RPartO 
QUEST - RPartO 
CTree - RPartO 
QUEST - RPartI 
CTree - RPartI 
CTree - QUEST 



J4.8(cv) - J4.8 
RPartO - J4.8 
RPartt - J4.8 
QUEST- J4.8 
CTree - J4.8 
RPartO - J4.8(cv) 
RPartt - J4.8(cv) 
QUEST- J4.8(cv) 
CTree - J4.8(cv) 
RPartI - RPartO 
QUEST - RPartO 
CTree - RPartO 
QUEST -RPartI 
CTree - RPartI 
CTree - QUEST 




w 




w 




w 




w 




w 
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w 




w 




w 






w 








(*1 
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Breast Cancer: Misciassification 



Breast Cancer: Complexity 



J4.8(cv) - J4.8 -■ 
RPartO - J4.8 - 
RPartI - J4.8 - 
QUEST- J4.8 - 
CTree - J4.8 - 
RPartO - J4.8(cv) - 
RPartI - J4.8(cv) - 
QUEST- J4.8(cv) - 
CTree - J4.8(cv) - 
RPartI - RPartO - 
QUEST - RPartO - 
CTree - RPartO - 
QUEST - RPartI - 
CTree - RPartI - 
CTree - QUEST - 





J4.8(cv) - J4.8 
RPartO - J4.8 
RPartI - J4.8 
QUEST- J4.8 
CTree - J4.8 
RPartO - J4.8(cv) 
RPartI - J4.8(cv) 
QUEST- J4.8(cv) 
CTree - J4.8(cv) 
RPartI - RPartO 
QUEST - RPartO 
CTree - RPartO 
QUEST - RPartI 



CTree - QUEST 
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Fig. 1. Simultaneous confidence intervals of pairwise performance differences (left: miscias- 
sification, right: complexity) for Pima Indians diabetes (top) and breast cancer (bottom) data. 



As can be seen from the performance plots for Pima Indian diabetes in Figure 1 , 
standard J4.8 is outperformed (in terms of misciassification as well as model com- 
plexity) by the other tree learners. All other algorithm comparisons indicate equal 
predictive performances, except for the comparison of RPartO and J4.8(cv), where 
the former learner performs slightly better than the latter. On this particular dataset 
tuning enhances the predictive performance of J4.8, while the misciassification rates 
of the differently tuned RPart versions are not subject to significant changes. In terms 
of model complexity J4.8(cv) produces larger trees than the other learners. Looking 
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Breast Cancer 
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confidence threshold (C) 



Fig. 2. Distribution of J4.8(cv) parameters obtained through cross validation on Pima Indians 
diabetes and breast cancer data sets. 



at the breast cancer data yields a rather different picture: Both RPart versions are 
outperformed by J4.8 or its tuned alternative in terms of predictive accuracy. Similar 
to Pima Indians diabetes, J4.8 and J4.8(cv) tend to build significantly larger trees 
than RPart. On this dataset, CTree has a slight advantage over all other algorithms 
except J4.8 in terms of predictive accuracy. For J4.8 as well as RPart, tuning does 
not promise to increase predictive accuracy significantly. A closer look at the dif- 
fering behavior of J4.8(cv) under cross validation for both data sets is provided in 
Figure 2. In contrast to the breast cancer example, the results based on the Pima 
Indians diabetes dataset (on which tuning of J4.8 caused a significant performance 
increase) show a considerable difference in choice of parameters. The multiple infer- 
ence results gained from all datasets considered in this simulation experiment (just 
like the results derived from the two datasets above) form the basis on which further 
aggregation analyses of Section 3.2 are built upon. 

3.2 Results across data sets: Consensus Rankings 

Having 18 x 6 = 108 performance distributions of the 6 different learners applied to 
18 bootstrap data settings at hand, aggregation methods can do a great favor to al- 
low for summarizing and comparing algorithmic performance. The underlying dom- 
inance relations derived from the multiple testing are summarized by simple sums in 
Table 2 and by the corresponding median linear order rankings in Table 3. In Table 2, 
rows refer to winners, while columns denote the losers. For example J4.8 managed 
to outperform QUEST on 1 1 datasets and 4 times vice versa, i.e., on the remaining 3 
datasets, J4.8 and QUEST perform equally well. 

The median linear order for misclassification reported in Table 3 suggests that 
tuning of J4.8 instead of using the heuristic approach is worth the effort. A similar 
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Table 2. Summary of predictive performance dominance relations across all 18 datasets based 
on misclassification rates and model complexity (columns refer to losers, rows are winners). 



Misclassification 


J4.8 


J4.8(cv) 


RPartO 


RPartl 


QUEST 


CTree 


E 


J4.8 


0 


2 


9 


9 


11 


8 


39 


J4.8(cv) 


4 


0 


8 


9 


11 


9 


41 


RPartO 


5 


6 


0 


7 


10 


7 


35 


RPartl 


6 


4 


1 


0 


8 


6 


25 


QUEST 


4 


2 


2 


5 


0 


7 


20 


CTree 


7 


6 


7 


8 


9 


0 


37 


E 


26 


20 


27 


38 


49 


37 





Complexity 


J4.8 


J4.8(cv) 


RPartO 


RPartl 


QUEST 


CTree 


E 


J4.8 


0 


1 


0 


0 


2 


0 


3 


J4.8(cv) 


17 


0 


0 


0 


5 


3 


25 


RPartO 


18 


18 


0 


0 


13 


15 


64 


RPartl 


18 


18 


16 


0 


14 


15 


81 


QUEST 


15 


13 


5 


4 


0 


10 


47 


CTree 


18 


14 


3 


2 


8 


0 


45 


E 


86 


64 


24 


6 


42 


43 





Table 3. Median linear order consensus rankings for algorithm performance 





Misclassification 


Complexity 


T 


J4.8(cv) 


RPartl 


2 


J4.8 


RPartO 


3 


RPartO 


QUEST 


4 


CTree 


CTree 


5 


RPartl 


J4.8(cv) 


6 


QUEST 


J4.8 



conclusion can be made for the RPart versions. Here, the median linear order sug- 
gests that the common one standard error rule performs worse. For both cases, the 
underlying dominance relation figures of Table 2 catch our attention. Regarding the 
first case, J4.8(cv) only dominates J4.8 in four of six data settings, in which a signifi- 
cant test decision for performance differences could be made. In addition the remain- 
ing 12 data settings yield equivalent performances. Therefore superiority of J4.8(cv) 
above J4.8 is questionable. In contrast the superiority of RPartO vs. RPartl seems to 
be more reliable but still the number of data settings producing tied results is high. A 
comparison of the figures of CTree and the RPart versions confirms previous findings 
(Hothorn et al., 2006) that CTree and RPart often perform equally well. The ques- 
tion concerning the dominance relation between J4.8 and RPart cannot be answered 
easily: Overall, the median linear order suggests that the J4.8 decision tree versions 
are superior to the RPart tree learners in terms of predictive performance. But still, 
looking at the underlying relations of the best performing versions of both algorithms 
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(J4.8(cv) and RPartO) reveals that a confident decision concerning predictive supe- 
riority cannot be made. The number of differences in favor of J4.8(cv) is only two 
and no significant differences are reported on four data settings. A brief look at the 
complexity ranking (Table 3) and the underlying complexity dominance relations 
(Table 2, bottom) shows that J4.8 and its tuned version produce more complex trees 
than the RPart algorithms. While analogous analyses of comparing J4.8 versions to 
CTree do not indicate confident predictive performance differences, superiority of 
the J4.8 versions versus QUEST in terms of predictive accuracy is evident. 



C\J 
O 
CO 

<o 

CM 
O 

0.0 0.1 0.2 0.3 0.4 




medium confidence threshold (C) 



Fig. 3. Medians of the J4.8(cv) tuning parameter distributions for C and M 



To aggregate the tuning results from J4.8(cv), Figure 3 depicts the median C 
and M parameters chosen for each of the 18 parameter distributions. It confirms 
the finding from the individual breast cancer and Pima Indians diabetes results (see 
Figure 2) that the parameter chosen by cross-validation can be far off the default 
values for C and M. 

4 Discussion and further work 

In this paper, we present results of a medium scale benchmark experiment with a 
focus on popular open-source tree-based learners available in R. With respect to 
our two main objectives - performance differences between C4.5 and CART, and 
heuristic choice of hyper-parameters - we can conclude: (1) The fully cross-validated 
J4.8(cv) and RPartO perform better than their heuristic counterparts J4.8 (with fixed 
hyper-parameters) and RPart 1 (employing a 1 -standard-error rule). (2) In terms of 
predictive performance, no support for the claims of (clear) superiority of either al- 
gorithm can be found: J4.8(cv) and RPartO lead to similar misclassification results, 
however J4.8(cv) tends to grow larger trees. Overall, this suggests that many beliefs 
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or preconceptions about the classical tree algorithms should be (re-)assessed using 

benchmark studies. Our contribution is only a first step in this direction and further 

steps will require a larger study with additional datasets and learning algorithms. 
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Abstract. Spelling correction is the task of correcting words in texts. Most of the available 
spelling correction tools only work on isolated words and compute a list of spelling sugges- 
tions ranked by edit-distance, letter-n-gram similarity or comparable measures. Although the 
probability of the best ranked suggestion being correct in the current context is high, user 
intervention is usually necessary to choose the most appropriate suggestion (Kukich, 1992). 

Based on preliminary work by Sabsch (2006), we developed an efficient context sensi- 
tive spelling correction system dcClean by combining two approaches: the edit distance based 
ranking of an open source spelling corrector and neighbour co-occurrence statistics computed 
from a domain specific corpus. In combination with domain specific replacement and abbre- 
viation lists we are able to significantly improve the correction precision compared to edit 
distance or context based spelling correctors applied on their own. 



1 Introduction 

In this paper we present a domain specific and context-based text preparation com- 
ponent for processing noisy documents in the automobile quality management. More 
precisely the task is to clean texts coming from vehicle workshops, being typed in at 
front desks by service advisors and expanded by technicians. Those texts are always 
written down under pressure of time, using as few words as possible and as many 
words as required to describe an issue. Consequently this kind of textual data is ex- 
tremely noisy, containing common language, technical terms, lots of abbreviations, 
codes, numbers and misspellings. More than 10% of all terms are unknown with 
respect to commonly used dictionaries. 

In literature basically two approaches are discussed to handle text cleaning: dic- 
tionary based algorithms like Aspell work in a semi-automatic way by presenting 
suggestions for unknown words, which are not in a given dictionary. For all of those 
words a list of possible corrections is returned and the user has to check the context 
to choose the appropriate one. This is applicable when supporting users in creating 
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single documents. But for automatically processing large amounts of textual data this 
is impossible. 

Context based spelling correction systems like WinSpell or IBMs csSpell only 
make use of context information. The algorithms are trained on a certain text corpus 
to learn probabilities of word occurrences. When analysing a new document for mis- 
spellings every word is considered as being suspicious. A context dependent proba- 
bility for the appearance is calculated and the most likely word is applied. A more 
detailed introduction into context-based spelling correction can be found in Golding 
and Roth (1995), Golding and Roth (1999) and Al-Mubaid and Trumper (2001). 

In contrast to existing work dealing either with context information or using dic- 
tionaries in our work we combine both approaches to increase efficiency as well as to 
assure high accuracy. Obviously, text cleaning is a lot more then just spelling correc- 
tion. Therefore we add technical dictionaries, abbreviation lists, language recognition 
techniques and word splitting and merging capabilities to our approach. 



2 Linguistics and context sensitivity 

This section gives a brief introduction into underlying correction techniques. We 
distinguish two main approaches: linguistic and context based algorithms. Phonetic 
codes are a linguistic approach to pronunciation and are usually based on initial ideas 
of Russell and Odell. They developed the Soundex code, which maps words with sim- 
ilar pronunciations by assigning numbers to certain groups of letters. Together with 
the hrst letter of the word the first three digits result in the Soundex code where the 
same codes mean same pronunciation. Current developments are the Metaphone and 
the improved Double Metaphone (Phillips, 2000) algorithm by Lawrence Philips, the 
latter one is currently used by the ASpell algorithm. Edit distance based algorithms 
are a second member of linguistic methods. They calculate word distances by count- 
ing letter- wise transformation steps to change one word into another. One of the best 
known member is the Levenshtein algorithm, which uses the three basic operations 
replacement, insertion and deletion to calculate a distance score for two words. 

Context based correction methods usually take advantage of two statistical mea- 
sures: word frequencies and co-occurrences. The word frequency f(H’) for a word w 
counts the frequency of its appearance within a given corpus. This is done for ev- 
ery unique word (or token) using the raw text corpus. The result is a list of unique 
words which can be ordered and normalised with the total number of words. With 
co-occurrences we refer to a pair of words, which commonly appear in similar con- 
texts. Assuming statistical independence between the occurrence of two words vvi 
and W 2 , the estimated probability Pe{w\W 2 ) of them occurring in the same context 
can be easily calculated with: 

^£(^ 1 ^ 2 ) = P{WI)P{W2) ( 1 ) 

If the common appearance of two words is signihcantly higher than expected they 
can be regarded as co-occurrent. Given a corpus and context related word frequencies 
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the co-occurrence measure can be calculated using pointwise mutual information or 
likelihood ratios. We utilize pointwise mutual information with a minimum threshold 
filtering. 



CoOc(wi,W 2 ) = log 



P{WIW2) 

P{W])P{W2) 



( 2 ) 



Depending on the size of a given context, one can distinguish between three kinds 
of co-occurrences. Sentence co-occurrences are pairs of words which appear signifi- 
cantly often together within a sentence, neighbour co-occurrences occur significantly 
often side by side and window-based co-occurrences are normally calculated within 
a fixed window size. A more detailed introduction into co-occurrences can be found 
in Manning and Schiltze (1999) and Heyer et al. (2006). 



3 Framework for text preparation 

Before presenting our approach we will discuss requirements which we identified as 
being important for text preparation. They are based on general considerations but 
contain some application specific characteristics. In the second part we will explain 
our framework for a joint text preparation. 

3.1 General considerations 

Because we focus on automatically processing large amounts of textual documents 
we have to ensure fast processing and minimal human interaction. Therefore, after 
a configuration process the system must be able to clean texts autonomously. The 
correction error has to be minimized, but in contradiction to evaluations found in the 
literature we propose a very conservative correction strategy. If there is an unknown 
word it will only be corrected when certain thresholds are reached. As one can see 
during the evaluation we rather take a loss in recall than inserting an incorrect word. 
To detect words, which have to be corrected, we rely on dictionaries. If a word cannot 
be found in (general and custom prepared) dictionaries we regard it as suspicious. 
This is different to pure context based approaches; for instance Golding and Roth 
(1995) consider every word as suspicious. But this leads to two problems: first the 
calculation is computational complex and second we imply a new error probability 
of changing a correct word to an incorrect one. Even if the probability is below 0.1%, 
regarding 100.000 terms it would result in 100 misleading words. 

Our proposed correction strategy can be seen as an interaction between linguistic 
and context based approaches, extended by word frequency and manually created 
replacement lists. The latter ones are essential to expand abbreviations, harmonize 
synonyms, and speed-up replacements of very common spelling errors. 

3.2 Cleaning workflow 

Automatic text cleaning covers several steps which depend on each other and should 
obviously be processed in a row. We developed a sequential workflow, consisting 
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of independent modules, which are plugged into IBMs Unstructured Information 
Management Architecture (UIMA) framework. This framework allows to imple- 
ment analysis modules for unstructured information like text documents and to plug 
them into a run-time environment for execution. The UIMA-Framework takes care 
about resource management, encapsulation of document data and analysis results and 
even distribution of algorithms on different machines. For processing documents, the 
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Fig. 1. dcClean workflow 



UIMA framework uses annotations, the original text content is never changed, all 
correction tasks are recorded as annotations. The dcClean workflow (figure 1) con- 
sists of several custom-developed UIMA modules which we will explain in detail 
according to their sequence. 

Tokenizer: The tokenizer is the first module, splitting the text into single tokens. 
It can recognise regular words, different number pattern (such as dates, mileages, 
money values), domain dependent codes and abbreviations. The challenge is to han- 
dle non-letter characters like slashes or dots. For example a slash between two tokens 
can indicate a sentence border, an abbreviation (e.g. "r/r" = "right rear") or a number 
pattern. To handle all different kinds of tokens we parametrize the tokenizer with a 
set of regular expressions. 

Language recognition: Because all of the following modules use resources 
which might be in different languages, the text language has to be recognised be- 
fore further processing. Therefore we included the LanI library from the University 
of Leipzig. Based on characteristic word frequency lists for different languages the 
library determines the language for a given document by comparing the tokens to 
the lists. Because we process domain specific documents, the statistical properties 
are different in comparison to regular language specific data. Thus it is important to 
include adequate term frequency lists. 

Replacements: This module handles all kinds of replacements, because we 
deal with a lot of general and custom abbreviations and a pure spelling correction 
cannot handle this. We manually created replacement lists R for abbreviations, syn- 
onyms, certain multi-word-terms and misspellings, which are frequent but spelling 
correction algorithms fail to correct them properly. The replacement module works 
on word tokens, uses language dependent resources and incorporates context infor- 
mation. This is very important for the replacement of ambiguous abbreviations. If 
the module finds for example the word "It", this can mean both "light" and "left". 
To handle this we look-up the co-occurrence levels of each possible replacement as 
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left neighbour of the succeeding word Cooc{w,Wnght) and as right neighbour of the 
preceding word Cooc(w, w/g/,). The result is the replacement vr' with the maximum 
sum of co-occurrence level to the left neighbour word wieft and to the right neighbour 
word Wright- 



CoOcSum(w) = CoOc{w,Wleft) + CoOc{w, Wright) (3) 

w = argmax(CoocSum(H’)) (4) 

Merging: To merge words, which where split by an unintended blank character, 
this module sequentially checks two successive words for correct spelling, if one or 

both words are not contained in the dictionary but the joint word is, the two words 

get annotated as joint representation. 

Splitting and spelling correction: The last module of our workflow treats 
spelling errors and word splittings. To correct words which are not contained in the 
dictionary, dcClean uses the Java based ASpell implementation Jazzy and incorpo- 
rates word co-occurrences, word frequencies (both were calculated using a reference 
corpus), a custom developed weighting schema and a splitting component. If the 




Fig. 2. Spelling correction example 



module finds an unknown word Wm, it passes it to the ASpell component. The ASpell 
algorithm creates suggestions with the use of phonetic codes (Double Metaphone al- 
gorithm) and edit distances (Levenshtein distance). Therefore the algorithm creates 
a set containing all words of dictionary D which have the same phonetic code 

as the potential misspelling w„i or as one of its variants v € U with an edit distance 
of one. 

V = {v|Edit(v,Wm) <=0editl} (5) 

^ASpell = {w'|(Phon(w) = Phon(wm) V 3v : Phon(v) = Phon(w))} (6) 
Then set is filtered according to edit based threshold Qediti '■ 

SASpell ={w\we S^Spell A Edit(w, Wm) < Qeditl} (7) 

A context based set of suggestions Scooc is generated using co-occurrence statistics. 
Therefore we use a similar technique as during the replacements: this time we look- 
up all co-occurrent words as left neighbour of the succeeding word Cooc{w,Wright) 
and as right neighbour of the preceding word Cooc{w,wieft) of the misspelling. 

S'cooc = {w|C00c(w,H’/e/,) > QcoocV CoOc(w, Wright) > Qcooc} 



( 8 ) 
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The co-occurrence levels are summed up and filtered by Levenshtein distance mea- 
sure to ensure a certain word based similarity. 

Scooc = {w\w G S'cooc ^ Edit(w, Wm) < Qedit3} (9) 



The third set of suggestions Sspiit is created using a splitting algorithm. This algo- 
rithm provides the capability to split words, which are unintentionally written as one 
word. Therefore the splitting algorithm creates a set of suggestions SspUt, containing 
all possible splittings of word w into two parts 5 ^ and 5 ^ with s* G D and G D. To 
select the best matching correction w the three sets of suggestions SASpeih Scooc and 
S Split are joined 



s = s AS pell [J Scooc [J Ssplit 



( 10 ) 



and weighted according to their co-occurrence statistics, or - if there are no signifi- 
cant co-occurrents - according to their frequencies. For weighting the splitting sug- 
gestions we use the average frequencies or co-occurrence measures of both words. 
The correction w is the element with the maximum weight. 



Weight (w) 



Cooc(w), if 3w' G S : Cooc(w') > Qcooc, 
f(w) else 



( 11 ) 



w = argmax(Weight(w)) 

wGS 



( 12 ) 



4 Experimental results 

To evaluate the pure spelling correction component of our framework we just con- 
sider error types which other spellcheckers can handle as well, which excludes merg- 
ing, splitting or replacing words. We use a training corpus of one million domain 
specific documents (500MB of textual data) to calculate word frequencies and co- 
occurrence statistics. The evaluation is performed on a test set consisting of 679 
misspelled terms including their context. 

We compared dcClean with the dictionary based Jazzy spellchecker based on the 
ASpell algorithm and IBMs context based spellchecker csSpell. To get comparable 
results we set the Levenshtein distance threshold for dcClean and Jazzy to the same 
value, the confidence threshold for csSpell is set to 10% (This threshold is based 
on former experiments.). During the evaluation we counted words which were cor- 
rected accurately, words which were corrected, but the correction was mistaken, and 
words which were not changed at all (As explained in section 3.1 changes of correct 
words to incorrect ones are not considered, because this can be avoided by the use of 
dictionaries.). 

As can be seen in figure 3 dcClean outperforms both spellcheckers using either 
dictionaries or context statistics. The improvement in relation to Jazzy is due to the 
fact, that the Aspell algorithm just returns a set of suggestions. For this evaluation 
we always chose the best suggestion. But sometimes there are several similar ranked 
suggestions and in a certain number of cases the best result is not the one that fits 
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correct irtcorrect no change correct incorrect no change 



Fig. 3. Spelling correction Fig. 4. Entire text cleaning workflow 



the context. However, when solely choosing corrections from the context as done 
by csSpell, even a very low confidence threshold of 10% leads to the fact that most 
words are not changed at all. A reason for this are our domain specific documents 
where misspellings are sometimes as frequent as the correct word, or words are con- 
tinuously misspelled. 

To explain the need for a custom text cleaning workflow we show the improve- 
ments of our dcClean framework in comparison with a pure dictionary based spelling 
correction. Therefore we set up Jazzy with three different configurations: (1) using 
a regular English dictionary, (2) using our custom prepared dictionary and (3) using 
our custom prepared dictionary, splittings and replacements. The test set contains 200 
documents with 769 incorrect tokens and 9137 tokens altogether. Figure 4 shows that 
Jazzy with a regular dictionary performs very poorly. Even with our custom dictio- 
nary there are only slight improvements. The inclusion of replacement lists leads to 
the biggest enhancements. This can be explained by the amount of abbreviations and 
technical terms used in our data. But dcClean with its context sensitivity outperforms 
even this. 



5 Conclusion and future work 

In this paper we explained how to establish an entire text cleaning process. We 
showed how the combination of linguistic and statistic approaches improves not 
only spelling correction but also the entire cleaning task. The experiments illustrated 
that our spelling correction component outperformed dictionary or context based 
approaches and that the whole cleaning workflow performs better than using only 
spelling correction. 

In future work we will try to utilise dcClean for more languages than English, 
which might be especially difficult for languages like German, that have many com- 
pound words and a rich morphology. We will also extend our spelling correction 
component to handle combined spelling errors, like two words which are misspelled 
and accidentally written as one word. Another important part of our future work will 
be a more profound analysis of the suggestion weighting algorithm. A combination 
of frequency, co-occurrence level and Levenshtein distance may allow further im- 
provements. 
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Abstract. In industrial practice, quality management for manufacturing processes is often 
based on process capability indices (PCI) like Cp,Cpm,Cpi, and Cp^t- These indices measure 
the behavior of a process incorporating its statistical variability and location and provide a 
unitless quality measure. Unfortunately, PCIs are not able to identify those factors, having the 
major impact on quality as they are only based on measurement results and do not consider 
the explaining process parameters. In this paper an Operational Research approach, based 
on Branch and Bound is derived, which combines both, the numerical measurements and the 
nominal process factors. This combined approach allows to identify the main source for minor 
or superior quality of a manufacturing process. 



1 Introduction 

The quality of a manufacturing process can he seen as the ability to manufacture a 
certain product within its specification limits U , L and as close as possible to its tar- 
get value T, describing the point, where its quality is optimal. In literature, numerous 
process capability indices have been proposed in order to provide a unitless quality 
measures to determine the performance of a process, relating the preset specification 
limits to the actual behavior (Kotz and Johnson (2002)). This behavior can be de- 
scribed by the process variation and process location. In order to state future quality 
of a manufacturing process based on the past performance, the process is supposed 
to be stable or in control. This means, that both, process mean and process varia- 
tion has to be, on the long run, in between pre-defined limits. A common technique 
to monitor this are control charts, one of the tools, provided by Statistical Process 
Control. 

The basic idea for the most common indices is to assume, that the considered 
manufacturing process follows a normal distribution and the distance between the 
upper and lower specification limit should equal 6g. The commonly recognized “ba- 
sic" PCIs Cp,Cpm,Cpk and Cpmk can be summarized by a superstructure, which was 
introduced in Vannman (1995) and is referred to in literature as Cp{u,v): 
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Cp{u,v) 



d — u\i^ — M\ 

+ v{^i-TY 



( 1 ) 



where o is the process standard deviation, the process mean, d = {U — L) /2 toler- 
ance width, m= {U + L) /2 the mid-point between the two specification limits and T 
the target value. The “basic“ PCIs can be obtained by setting u and v to: 



Cp = Cp(0,0); Cpk = Cp{l,0) 
Cpm = Cp{0, 1); Cpmk = Cp{\,\) 



( 2 ) 



Estimators for the indices can be obtained by substituting /£by the sample mean X = 
a by the sample variance 1). They provide 

stable and reliable point estimators for processes following a normal distribution. 
However, in practice, this requirement is hardly met, thus the basic PCIs as defined 
in (1) are not appropriated for process with non-normal distributions. What is really 
needed are indices which do not depend on any kind of distribution in order to be 
useful for measuring quality of a process. 



Cp{u,v) 



d — u\m — M\ 



3 f99.865-h).1 35 ]2 



(3) 



In Pearn and Chen (1997) a generalization of the PCIs superstructure (I) is intro- 
duced, in order to cover those cases, where the underlying data does not follow a 
Gaussian distribution. The authors replaced the process standard deviation o by the 
99.865 and 0.135 quantiles of the empiric distribution function and (i by the median 
of the process. The idea behind this substitution is, that the difference between the 
quantiles Tgg.ses and To . 135 again equals 6o or C'p{u,v) = 1, assuming the special 
case, that the process follows a gaussian distribution. The special PCIs Cp,Cp„,C^^ 
and C'pj^ can be obtained by applying u and v as in (2). 

Assuming that the following assumptions hold, a class of non-parametric indices 
and a particular specimen thereof can be introduced: every manufacturing process is 
defined by two distinct sets. Let Y be the set of influence variables (process param- 
eters or process factors) and X the corresponding goal variables or measurements 
results, then a class of process indices can be defined as: 

Definition 1. LetX andY describe a manufacturing process. Furthermore, let f(x,y) 
be the empirical density of the underlying process and w{x) a kernel function. Then 



LIyf(^^y)^yd 



defines a class of empirical process indices. 

Obviously, if w{x) = xor w{x) = x^ we obtain the first and resp. the second moment 
of the process, as J^J^f(x,y)dydx = 1. But, to measure the quality of a process, 
we are interested in the relationship of the designed specification limits and the pro- 
cess behavior. A possibility is to chose the kernel function w(x) in such way, that it 
becomes a function of the designed limits U and L. 
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Definition 2. Let X, Y and f{x,y) be defined as in definition 1. Let U,L be specifica- 
tion limits. The Empirical Capability Index (Ed) is defined as: 

_ !Jy'^[L<x<v)f{x,y)dydx 

The E„- measures the percentage of data points which are in between the specification 
limits U and L. Therefore, it is more sensitive to outliners compared to the common, 
non-parametric indices. A disadvantage is, that for processes, having all data points 
within the specification limits, the always equals one, and therefore does not 
provide a comparable quality measure. To avoid this, the specification limits U and 
L have to be modified, in order to get “further into the sample", by linking them to 
the behavior of the process. 

However, after measuring the quality of a process, one might be interested if there 
are subsets of influence variables values, such that the quality of a process becomes 
better, if constraining the process only to this parameters. In the following section a 
non-parametric, numerical approach for identify those parameters is derived and an 
algorithm, which efficiently solves this problem is presented. 



2 Root Cause Analysis 

In literature a common technique to identify significant discrete parameters having 
an impact on numeric variables like measurement results, is the Analysis of Variance 
(ANOVA). As a limiting factor, techniques of the Variance Analysis are only useful, 
if the problem is of lower dimension. Additionally these variables should be well 
balanced or have a simple structure. Another constraint is the assumption, that the 
analyzed data has to follow a multivariate Gaussian distribution. In most applications 
these requirements are hardly ever met. The distribution of the parameters describing 
the measure variable is in general not Gaussian and of higher dimension. Also the 
combinations of the cross product of the parameters are non-uniformly and sparely 
populated nor have a simple dependence structure. Therefore, the method of Variance 
Analysis is not applicable. What is really needed, is a more general approach to 
identify the variables, responsible for minor or superior quality. 

A process can be defined as a set of influence variables (i.e. process parameters) 
Y = (Y^ , , T") consisting of values Y' =y\,. .. and a set of corresponding goal 
variables (i.e. measuremen^results^ Z = (Zi,...,Z„). If constraining the influence 
variable values to a subset T C y, T defines a sub-process of the original process Y . 
The support of a sub-process Y can be written as A(Z|T) := f(x,y)dydx and 

consequently, a conditional PCI is defined as Q(X\Y). Any of the indices defined 
in the previous section can be used,_whereby the value of the respective index is 
calculated on the conditional subset Z C Z. 

In order to determine those parameters having the greatest impact on quality, an 
optimal sub-process, consisting of optimal influence_combinations, has to be identi- 
fied. A naive approach would be, to maximize Q(X\Y) over all sub-processes Y GY. 
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Unfortunately, in general this yields a sub-process, which would only have a lim- 
ited support (N{X\Y*){n). A better approach is to think in economic terms and to 
weighten the factors responsible for minor quality, which we want to remove, by the 
costs of eliminating them. In practise this is not feasible, as to track the actual costs is 
too expensive. But it is likely, that rare factors, which are responsible for lower qual- 
ity are “cheaper" to remove than frequent influences. In other words, sub-processes 
with high support are preferable. 

Often the available sample set for process optimization is small, having numer- 
ous influence variables but only few measurement results. By limiting ourselves only 
to combinations of variables, we might get too small sub-process (having low sup- 
port). Therefore, we extend the possible solutions to combinations of variables and 
their values - the search space for optimal sub-processes is spanned by the power- 
set of the influence parameters P(T). The two sided problem, to find the parameter 
set combining on the one hand a optimal quality measure and on the other hand a 
maximal support, can be summarized by the the following optimization problem: 

Definition 3. _ 

( A(A|T) ^ max 
(P) = < Q{X\Y) > q^in 

[ Y G P(T) 

The solution of the optimization problem is the subset of process parameters with 
maximal support among those processes, having a better quality than the given thresh- 
old qmin- Often, qmin is set to the common values for process capability of 1.33 or 
1.67. 

Due to the nature of the application domain, the investigated parameters are dis- 
crete which inhibits an analytical solution but allows the use of Branch and Bound 
techniques. In the following we derive an algorithm which solves the optimization 
problem (3) by avoiding the evaluation of the exponential amount of possible combi- 
nations, spanned by the cross product of the influence parameters. In order to achieve 
this, a efflcient cutting rule is derived in the next section. 

Branch and bound algorithm 

To efficiently store and access the necessary information and to apply Branch and 
Bound techniques, a multitree was chosen as representing data structure. Each node 
of the multitree represents a possible combination of the influence parameters (sub- 
process) and is build out of the combination of the parents influence set and a new 
influence variable and its value(s). Fig. 1 depicts the data structure, whereby each 
nodes stands for all elements of the powerset of the considered variable. 

To find the optimal solution to the optimization problem (3), a depth-first search 
is applied to traverse the tree using a Branch and Bound principle. The idea, to branch 
and bound the traverse of the tree is based on the following thoughts: by descending a 
branch of the tree, the number of constraints is increasing, as new influence variables 
are added and therefore the sub-process support decreases (compare Fig. 1). Thus, 
if a node has a support lower than a given minimum support, there is no possibility 




Root Cause Analysis for Quality Management 409 




to find a node (sub-process) with a higher support in the branch below. This reduces 
the time to find the optimal solution significantly, as a good portion of the tree to 
traverse, can he omitted. 



Algorithm 1 Branch & Bound algorithm for process optimization 
1: procedure TraverseTree(K) 

2: y := {sub-nodes of y} 

3: for all V G (T do 

4: if A(A|y) > 

^max and Q(X\y) > q„u„ then 
5: /W = A(A|y) 

6: end if 

7: itN{X\y) > 

f^max and Q{X\y) < q„,i„ then 
8: TraverseTreefy) 

9: end if 

10: end for 

1 1 : end procedure 



In many real world applications, the influence domain is mixed, consisting of 
discrete data and numerical variables. To enable a joint evaluation of both influence 
types, the numerical data is transformed into nominal data hy mapping the continu- 
ous data onto pre-set quantiles. In most our applications, we chose 10%, 20%, 80% 
and 90% quantile, as they performed the best. 

Verification 

The optimum of the problem (3) can only be defined in statistical terms, as in practice 
the sample sets are small and the quality measures are only point estimators. There- 
fore, confidence intervals have to be used in order to get a more valid statement of 
the real value of the considered PCI. In the special case, where the underlying data 
follows a normal distribution, it is straight forward to construct a confidence inter- 
val. As the distribution of pr (Cp denotes the estimator of Cp) is known, a (1 — a)% 
confidence interval for Cp is given by 
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C{X) 




( 6 ) 



For the other parametric basic indices, in general there exits no analytical solution 
as they all have a non-centralized distribution. Different numerical approximation 
can be found in literature for Cpm,Cpk and Cpmk (see Balamurali and Kalyanasun- 
daram (2002) and Bissel (1989)). 

If there is no possibility to make an assumption about the distribution of the 
data, computer based, statistical methods as the Bootstrap method are used to calcu- 
late a confidence intervals. In Balamurali and Kalyanasundaram (2002), the authors 
present three different methods for calculating confidence intervals and a simulation 
study. As result, the method called BCa-Method outperformed the other two meth- 
ods, and therefore is used in our applications for assigning confidence intervals for 
the non-parametric basic PCIs, as described in (3). For the Empirical Capability In- 
dex Ed a simulation study showed that the Bootstrap-Standard-Method, as defined in 
Balamurali and Kalyanasundaram (2002), performed the best. A (l-a)% confidence 
interval for the Ed can be obtained by 



C(X)= + (7) 

where Ed denotes an estimator for E„-, Gg the Bootstrap standard deviation and <I>^* 
the inverse standard normal. 

As the results of the introduced algorithm are based on sample sets, it is im- 
portant to verify the soundness of the founded solutions. Therefore, the sample set 
to analyze is to be randomly divided into two disjoint sets: training and test set. A 
set of possible optimal sub-process is generated, by applying the describe algorithm 
and the referenced Bootstrap-methods to calculate confidence intervals. In a second 
step, the root cause analysis algorithm is applied to the test set. The final output is a 
verified sub-process. 



3 Computational results 

A proof on concept was performed using data of a foundry plant and engine man- 
ufacturing in the premium automotive industry. The 32 analyzed sample sets com- 
prised measurement results describing geometric characteristics like the position of 
drill holes or surface texture of the produced products and the corresponding influ- 
ence sets. The data sets consist of 4 to 14 different values, specifying for example a 
particular machine number or a workers name. An additional data set, recording the 
results of a cylinder twist measurement having 76 influence variables, was used to 
evaluated the algorithm for numerical parameter sets. Each of the analyzed data sets 
has at least 500 and at most 1000 measurement results. 

The evaluation was performed for the non-parametric Cp and the empirical ca- 
pability index Ej,, using the describe Branch and Bound principle. Additionally a 
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Fig. 2. Computational time for combinatorial search vs. Branch and Bound 



combinatorial search for the optimal solution was carried out to demonstrate the ef- 
ficiency of our approach. The reduction of computational time, using the Branch and 
Bound principle, amounted to two orders of magnitude in comparison to the combi- 
natorial search as can be seen in Fig. 2. In average, the Branch and Bound method 
outperformed the combinatorial search by the factor of 230. For the latter it took 
in average 23 minutes to evaluating the available data sets. However, using Branch 
and Bound reduced the computing time in average to only 5.7 seconds for the non- 
parametric Cp and to 7.2 seconds using the E„. The search for an optimal solution 
was performed to depth of 4, which means, that all sub-process have no more than 
4 different influence variables. A higher depth level did not yield any other results, 
as the support of the sub-processes diminishes with increasing number of influence 
variables. Obviously, the computational time for finding the optimal sub-process in- 
creases with the number of influence variables and their values. This fact explains 
the significant jump of the combinatorial computing time, as the first 12 sample sets 
are made up of only 4 influence variables, whereas the others consist of up to 17 
different influence variables. 

As the number of influence parameters of the numerical data set where, compared 
to the other data sets, significantly larger, it took, about 2 minutes to find the optimal 
solution. The combinatorial search was not performed, as 76 influence variables each 
with 4 values would have take too long. 



4 Conclusion 

In this paper we have presented a root cause analysis algorithm for process optimiza- 
tion, with the goal to identify those process parameters having a server impact on the 
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quality of a manufacturing process. The basic idea was to transform the search for 
those quality drivers into a optimization problem and to identify optimal parameter 
subsets using Branch and Bound techniques. This method allows for reducing the 
computational time to identifying optimal solutions significantly, as the computa- 
tional results show. Also a new class of convex process indices was introduced and a 
particular specimen, the process capability index, 'Ed is defined. Since the search for 
quality drivers in quality management is crucial to industrial practice, the presented 
algorithm and the new class of Indices may be useful for a broad scope of quality 
and reliability problems. 
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Abstract. Text mining refers generally to the process of deriving high quality information 
from unstructured texts. Unstructured texts come in many shapes and sizes. It may be stored 
in research papers, articles in technical periodicals, reports, documents, web pages etc. Here 
we introduce a new approach for finding textual patterns representing new technological ideas 
and inventions in unstructured technological texts. 

This text mining approach follows the statements of technique philosophy. Therefore a tech- 
nological idea or invention represents not only a new mean, but a new purpose and mean 
combination. By systematic identification of the purposes, means and purpose-mean combi- 
nations in unstructured technological texts compared to specialized reference collections, a 
(semi-) automatic finding of ideas and inventions can be realized. Characteristics that are used 
to measure the quality of these patterns found in technological texts are comprehensibility and 
novelty to humans and usefulness for an application. 



1 Introduction 

The planning of technological and scientific research and development (R&D-) pro- 
grams is a very demanding task, e.g. in the R&D-program of the German ministry 
of defense there are at least over 1000 different R&D-projects running simultane- 
ously. They all refer to about 100 different technologies in the context of security 
and defense. There is always a lot of change in these programs - a lot of projects 
starting new and a lot of projects running out. One task of our research group is find- 
ing new Rc&D-areas for this program. New ideas or new inventions are a basis for 
a new R&D-area. That means for planning new R&D-areas it is necessary to iden- 
tify a lot of new technological ideas and inventions from the scientific communify 
(Ripke ef al. (1972)). Up to now, the identification of new ideas and inventions in un- 
structured texts is done manually (that means by humans) without the support of text 
mining. Therefore in this paper we will describe the theoretical background of the 
text mining approach to discover (semi-) automatically textual patterns representing 
new ideas and inventions in unstructured technological texts. 
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Hotho (2004) describes the characteristics that are used to measure the quality of 
these textual patterns extracted by knowledge discovery tasks. The characteristics are 
comprehensibility and novelty to the users and usefulness for a task. In this paper the 
users are program planers or researchers and the task is to hnd ideas and inventions 
which can be used as basis for new R&D-areas. 

It is known from the cognition research that analysis and evaluation of textual 
information requires the knowledge of a context (Strube (2003)). The selection of the 
context depends on the users and the tasks. Referring to our users and our task, we 
have on one hand textual information about world wide existing technological R&D- 
projects (furthermore this is called "raw information"). This information contains a 
lot of new technological ideas and inventions. New means, that ideas and inventions 
are unknown to the user (Ipsen (2002)). On the other hand we have descriptions 
about own R&D-projects. This represents our knowledge base and furthermore this 
is called "context information". Ideas and inventions in the context information are 
already known to the user. 

To create a text mining approach for finding ideas and inventions inside the raw 
information we have to create a common structure for raw and context information 
first. This is necessary for the comparison between raw and context information e.g. 
to distinguish new (that means unknown) ideas and inventions from known ideas and 
inventions. 

In short we have to do 2 steps: 1. Create a common structure for raw and context 
information as a basis for the text mining approach. 2. Create a text mining approach 
for finding new, comprehensible and useful ideas and inventions inside the raw in- 
formation. Below we describe step 1 and 2 in detail. 



2 A common structure for raw and context information 

In order to perform knowledge discovery tasks (e.g. finding ideas and inventions) it 
is required that raw information and context information have to be structured and 
formatted in a common way as described above. In general the structure should be 
rich enough to allow for interesting knowledge discovery operations and it should be 
simple enough to allow an automatically converting of all kind of textual information 
in a reasonable cost as described by Feldman et al.(1995). 

Raw information is stored in research papers, articles in technical periodicals, 
reports, documents, databases, web pages etc. That means raw information contains 
a lot of different structures and formats. Normally context information also contains 
different structures and formats. Converting all structures and formats to a common 
structure and format for raw and context information by keeping all structure infor- 
mation available costs plenty of work. Therefore our structure approach is to convert 
all information into plain text format. That means firstly we destroy all existing struc- 
tures and secondly build up a new common structure for raw and context information. 

The new structure should refer to the relationship between terms or 
term-combinations (Kamphusmann (2002)). In this paper we realize this by creating 
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sets of domain specific terms which occur in the context of a term or a combination 
of terms. For the structure formulation we define the term unit as word. 

First we create a set of domain specific terms. 

Definition 1. Let (a text) T = [a)i,..,a)„] be a list of terms (words) ( 0 ; in order of 
appearance and let n £ N be the number of terms in T and i € Let X = 

{coi, ..,(5m} be a set of domain specific stop terms (Lustig ( 1986)) and let m £ N be 
the number of terms in X. £2 - the set of domain specific terms in text T - is defined 
as the relative complement Twithout'L. Therefore: 

Q = T\I, (1) 



For each O),- S £2 we create a set of domain specific terms which occur in the 
context of term (O,-. 

Definition 2. Let I & N be a context length of term oi,- that means the maximum dis- 
tance between Ol),- and a term (Oj in text T. Let the distance be the number of terms 
(words) which occur between m,- and (Stj including the term ti)j and let j G [ 1 , <I>; 
is defined as a set of those domain specific terms which occur in an I -length context 
of term ( 0 ,- in text T: 

<I>; = {(0^j((0/ G i2) A (|; - 7| < /) A (o); ^ (dj) } (2) 

For each combination of terms in <I>, we create a set of domain specific terms 
which occur in the context of this combination of terms. 



Definition 3. Let bp € Q, be a term in a list of terms with number /? G [1, .., 7 /]. Let 
5i,..,5^ be a list of terms - in further this will be called term-combination - with 
bp ^ bfip ^ q € [l,..,/r] that occurs together in an I -length context of term 61 in 
text T . Let p G N be the number of terms in the term-combination b\,..,bp. Hg^ g 
is defined as the set of domain specific terms which occur together with the term- 
combination 5i , 5^ in an l-length context of term 5i in text T: 







p=2 



= m; A y 5p c <!>,■ 

p=2 



(3) 



In the Figure 1 an example for the relationships in set Hg g is presented. 
The term-combination (sensor, infrared, uncooled) has a relationship to the term- 
combination (focal, array, plane) because uncooled infrared sensors can be built by 
using the focal plane array technology. 

The text T could be a) the textual raw information or b) the textual context infor- 
mation. As result we get In case of a) HgJ”' g and in case of b)Hg"”*“^ . 



Definition 4. To identify terms or term-combinations in the raw information which 
also occur in the context information - that means the terms or term-combinations 
are known to the user - we define Sg”"™ as the set of terms which occur in g 



and .• 



’^known ’^raw p, ’^context 

' ’^81, .., 8 ^ 



( 4 ) 
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sensor, infrared, sensor. Infrared, 

uncooled, array uncooled,array, 

focal 








Fig. 1. Example for the relationships in Hg^ g : Uncooled infrared sensors can be build by 
using the focal plane array technology. 



3 Relevant aspects for the text mining approach from technique 
philosophy 

The text mining approach follows the statements of technique philosophy (Rohpohl 
(1996)). Below we describe some relevant aspects of the statements and some spe- 
cific conclusions for our text mining approach. 

a) A technological idea or invention represents not only a new mean, but a new 
purpose and mean combination. That means to find an idea or invention it is 
necessary to identify a mean and an appertaining purpose in the raw information. 
Appertaining means that purpose and mean shall occur together In an 1-length 
context. Therefore for our text mining approach we firstly want to identify a 
mean and secondly we want to identify an appertaining purpose or vice versa. 

b) Purposes and means can be exchanged. That means a purpose can become a mean 
in a specific context and vice versa. Example: A raw material (mean) is used to 
create an intermediate product (purpose). The intermediate product (mean) is 
then used to produce a product (purpose). In this example the intermediate prod- 
uct changes from purpose to mean because of the different context. Therefore 
for our text mining approach it is possible to identify textual patterns represent- 
ing means or purposes. But it is not possible to distinguish between means and 
purposes without the knowledge of the specific context. 

c) A purpose or a mean is represented by a technical term or by several technical 
terms. Therefore purposes or means can be represented by a combination of do- 
main specific terms (e.g. 5i,..,5^) which occur together in an 1-length context. 
The purpose-mean combination is a combination of 2 term-combinations and it 
also occurs in an 1-length context as described in 3 a). For the formulation a term- 
combination 5i, ..,5^ represents a mean (a purpose) only if g ^ 0 , which 
means there are further domain-specific terms representing a purpose (a mean) 
which occur in an /-length context together with the term-combination 5i, ..,5^ 
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in the raw information. 

d) To find an idea or invention that is really new to the user, the purpose-mean 

combination must be unknown to the user. That means a mean and an appertain- 
ing purpose in the raw information must not occur as mean and as appertaining 
purpose in the context information. For the formulation the term-combination 
5i, ..,6^ from 3 c) represents a mean (a purpose) in a new idea or invention only 
if = 0 , which means there are no further domain-specific ferms which 

occur in an 1-lengfh context together with the term-combination in the 

raw and in the context information. 

e) To find an idea or invention thaf is comprehensible fo fhe user, eifher the purpose 
or the mean must be known to the user. That means one part (a purpose or a mean) 
of the new idea or invention is known to the user and the other part is unknown. 
The user understand the known part because it is also a part of a known idea or 
invention that occurs in the context information and therefore he gets an access 
to the new idea or invention in the raw information. 

That means the terms representing either the purpose or the mean in the raw 
information must occur as purpose or mean in the context information. For the 
formulation the term-combination 5i,..,5^ from 3 d) represents a mean (a pur- 
pose) in a comprehensible idea or invention only if ^ 0 , which means 

5i, ..,5;j is known to the user and there are further domain-specific terms repre- 
senting a purpose (a mean) which occur in an 1-length context together with the 
term-combination 5i , ..,5;j in the context information. 

f) Normally an idea or an invention is useful for a specific task. Transferring an idea 
or an invention to a different task makes it sometimes necessary that the idea or 
invention has to be changed to become useful for the new task. To change an idea 
or invention you have to change either the purpose or the mean. That is because 
the known term-combination 5i, ..,5^ from 3 e) must not be changed, otherwise 
it will become unknown to the user and then the idea or invention is not compre- 
hensible to the user as described in 3 e). 

g) After some evaluation we get the experience that for finding ideas and inven- 
tions the number of known terms (e.g. representing a mean) and the number of 
unknown terms (e.g. representing the appertaining purposes) shall be well bal- 
anced. Example: one unknown term among many known terms often indicates 
that an old idea got a new name. Therefore the unknown term is probably not a 
mean or a purpose. That means the probability that 5i, ..,5^ is a mean or a pur- 
pose increases when jj is close to the cardinality of HgJ*^ g . 

h) There are often domain specific stop terms (like better, higher, quicker, inte- 
grated, minimized etc.) which occur with ideas and inventions. They point to 
a changing purpose or a changing mean and can be indicators for ideas and in- 
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ventions. 

i) An identified new idea or invention can be a basis for further new ideas and 
inventions. That means all ideas and inventions that are similar to the identihed 
new idea and invention are also possible new ideas and inventions. 



4 A text mining approach for finding new ideas and inventions 

In this paper we want to create a text mining approach by applying point 3 a) to 3 g). 
Further we want to prove the feasibility of our text mining approach. 

Firstly we want to identify a mean and secondly we want to identify an apper- 
taining purpose helow as described in 3 a). The other case - hrstly identify a purpose 
and secondly identify an appertaining mean - is trivial because of the purpose-mean 
dualism described in 3 b). 

Definition 5. VFe define g ) as the probability that the term-combination 

5i,..,5^ in the raw information is a mean. That means whether p is close to the 
cardinality of'S'f^ g or not as described in 3 g): 



p{E 



raw \ 



77 raw 
77 raw 






7 raw 



(5) 



The user determines a minimum probability pmin- For the text mining approach 
the term-combinations are means only if 



a) Hg'J**' g ^ 0 as described in 3 c), 

b) Hg""™ = 0 as described in 3 d) to get a new idea or invention, 

c) ^ 0 as described in 3 e) to get a comprehensible idea or invention and 

d) g^) > Pmin as described in 3 g). 

For each of these term-comhinations we collect all appertaining purposes (that 
means the combinations of all further terms) which occur in an 1-length context to- 
gether with 5i, ..,5;j in the raw information. 

We present each 5i, ..,5^ as a known mean and all appertaining unknown pur- 
poses to the user. The user selects the suited purposes for his task or he combines 
some purposes to a new purpose. That means he changes the purpose to become use- 
ful for his task as described in 3 f). Additionally it is possible that the user changes 
known means to known purposes and appertaining purposes to appertaining means 
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as described in 3 b) because at this point the user gets the knowledge of the specific 
context. 

With this selection the user gets the purpose-mean combination that means he 
gets an idea or invention. This idea or invention is novel to him because of 3 d) and 
it is comprehensible to him because of 3 e). Further it is useful for his application 
because the user selects the suited purposes for his task. 



5 Evaluation and outlook 

We have done a first evaluation with a text about R&D-projects from the USA as raw 
information (Fenner et al. (2006)), a text about own R&D-projects as context infor- 
mation (Thorleuchter (2007)), a stop word list created for the raw information and 
the parameter values I = 8 and = 50%. The aim is to find new, comprehensible 
and useful ideas and inventions in the raw information. According to human experts 
the number of these relevant elements - the so-called "ground truth" for the evalua- 
tion - is eighteen. That means eighteen ideas or inventions can be used as basis for 
new R&D-areas. With the text mining approach we extracted about fifty patterns (re- 
trieved elements) from the raw information. The patterns have been evaluated by the 
experts. Thirteen patterns are new, comprehensible and useful ideas or inventions that 
means thirteen from fifty patterns are relevant elements. Five new, comprehensible 
and useful ideas or inventions are not found by the text mining approach. Therefore, 
as result we get a precision value of about 26% and a recall value of about 72%. This 
is not representative because of the small number of relevant elements but we think 
this is above chance and it is sufficient to prove the feasibility of the approach. 

For future work firstly we will enlarge the stop word list to a general stop word 
list for technological texts and optimize the parameters concerning the precision and 
recall value. Secondly we will enlarge the text mining approach with further thoughts 
e.g. the two thoughts described in 3 h) and 3 i). The aim of this work shall be to get 
better results for the precision and recall value. Thirdly we will implement the text 
mining approach to a web based application. That will help the users to find new, 
comprehensible and useful ideas and inventions with this text mining approach. Ad- 
ditionally with this application it will be easier for us to do a representative evalua- 
tion. 



6 Acknowledge 

This work was supported by the German Ministry of Defense. We thank Joachim 
Schulze for his constructive technical comments and Jorg Fenner for helping collect 
the raw and context information and evaluate the text mining approach. 




420 



Dirk Thorleuchter 



References 

FELDMAN, R. and DAGAN, I. (1995): Kdt - knowledge discovery in texts. In: Proceedings of 
the First International Conference on Knowledge Discovery (KDD). Montreal, 1 12-1 13. 

FENNER, J. and THORLEUCHTER, D. (2006): Strukturen und Themengebiete der mittel- 
standsorientierten Forschungsprogramme in den USA. Fraunhofer INT’s edition, Eu- 
skirchen, 2. 

HOTHO, A. (2004): Clustern mit Hintergrundwissen. Univ. Diss., Karlsruhe, 29. 

IPSEN, C. (2002): F&E-Programmplanung bei variabler Entwicklungsdauer. Verlag Dr. Ko- 
vac, Hamburg, 10. 

KAMPHUSMANN, T. (2002): Text-Mining. Symposion Publishing, Dusseldorf, 28. 

LUSTIG, G. (1986): Automatische Indexierung zwischen Forschung und Anwendung. Georg 
01ms Verlag, Hildesheim, 92. 

RIPKE, M. and STOBER, G. (1972): Probleme und Methoden der Identifizierung potentieller 
Objekte der Forschungsforderung. In: H. Paschen and H. Krauch (Eds.): Methoden und 
Probleme der Forschungs- und Entwicklungsplanung. Oldenbourg, Miinchen, 47. 

ROHPOHL, G. (1996): Das Ende der Natur. In: L. Schafer and E. Straker (Eds.): Naturauf- 
fassungen in Philosophie, Wissenschaft und Technik. Bd. 4, Freiburg, Miinchen, 151. 

STRUBE, G. (2003): Menschliche Informationsverarbeitung. In: G. Gorz, C.-R. Rollinger and 
J. Schneeberger (Eds.): Handbuch der Kiinstlichen Intelligenz. 4. Auflage, Oldenbourg, 
Miinchen, 23-28. 

THORLEUCHTER, D. (2007): Uberblick liber F&T-Vorhaben und ihre Ansprechpartner im 
Bereich BMVg. Fraunhofer Publica, Euskirchen, 2-88. 




Investigating Classifier Learning Behavior with 
Experiment Databases 



Joaquin Vanschoren and Hendrik Blockeel 

Computer Science Dept., K.U.Leuven, 

Celestijnenlaan 200A, 3001 Leuven, Belgium 



Abstract. Experimental assessment of the performance of classification algorithms is an im- 
portant aspect of their development and application on real-world problems. To facilitate this 
analysis, large numbers of such experiments can be stored in an organized manner and in com- 
plete detail in an experiment database. Such databases serve as a detailed log of previously 
performed experiments and a repository of verifiable learning experiments that can be reused 
by different researchers. We present an existing database containing 250,000 runs of classi- 
fier learning systems, and show how it can be queried and mined to answer a wide range of 
questions on learning behavior. We believe such databases may become a valuable resource 
for classification researchers and practitioners alike. 



1 Introduction 

Supervised classification is the task of learning from a set of classified training exam- 
ples (x,c(x)), where x&X (the instance space) and c{x) G C (a finite set of classes), 
a classifier function f :X ^ C such that / approximates c (the target function) over 
X. Most of the existing algorithms for learning / are heuristic in nature, and try to 
(quickly) approach c by making some assumptions that may or may not hold for the 
given data. They assume c to be part of some designated set of functions (the hy- 
pothesis space), deem some functions more likely than others, and strictly consider 
consistency with the observed training examples (not with A as a whole). While there 
is theory relating such heuristics to finding c, in many cases fhis relationship is not 
so clear, and the utility of a certain algorithm needs to be evaluated empirically. 

As in other empirical sciences, experiments should be performed and described 
in such a way that they are easily verifiable by other researchers. However, given the 
fact that the exact algorithm implementation used, its chosen parameter settings, the 
used datasets and the experimental methodology all influence the outcome of an ex- 
periment, it is practically not self-evident to completely describe such experiments. 
Furthermore, there exist complex interactions between data properties, parameter 
settings and the performance of learning algorithms. Hence, to thoroughly study 
these interactions and to assess the generality of observed trends, we need a suffi- 
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ciently large sample of experiments, covering many different conditions, organized 
in a way that makes their results easily accessible and interpretable. 

For these reasons, Blockeel (2006) proposed the use of experiment databases: 
databases describing a large number of learning experiments in complete detail, serv- 
ing as a detailed log of previously performed experiments and an (online available) 
repository of learning experiments that can be reused by different researchers. Bloc- 
keel and Vanschoren (2007) provide a detailed account of the advantages and disad- 
vantages of experiment databases, and give guidelines for designing them. As a proof 
of the concept, they present a concrete implementation that contains a full descrip- 
tion of the experimental conditions and results of 250,000 runs of classifier learning 
systems, together with a few examples of its use and results that were obtained from 
it. 

In this paper we provide a more detailed discussion of how this database can be 
used in practice to store the results of many learning experiments and to obtain a clear 
picture of the performance of the involved algorithms and the effects of parameter 
settings and dataset characteristics. We believe that this discussion may be of interest 
to anyone who may want to use this database for their own purposes, or set up a 
similar databases for their own research. 

We describe the structure of the database in Sect. 2 and the experiments in Sect. 
3. In Sect. 4 we illustrate the power of this database by showing how SQL queries 
and data mining techniques can be used to investigate classifier learning behavior. 
Section 5 concludes. 



2 A database for classification experiments 

To efficiently store and allow queries about all aspects of previously performed clas- 
sification experiments, the relationships between the involved learning algorithms, 
datasets, experimental procedures and results are captured in the database structure, 
shown in Fig. 1. Since many of these aspects are parameterized, we use instantiations 
to uniquely describe them. As such, an Experiment (central in the figure) consists 
of instantiations of the used learner, dataset and evaluation method. 

First, a Learner instantiation points to a learning algorithm (Learner), 
which is described by the algorithm name, version number, a url where it can be 
downloaded, and some generally known or calculated properties (Van Someren 
(2001), Kalousis & Flilario (2000)), like the used approach (e.g. neural networks) 
or how susceptible it is to noise. Then, if an algorithm is parameterized, the param- 
eter settings used in each learner instantiation (one of which is set as default) are 
stored in table Learner_parval. Because algorithms have different numbers and 
kinds of parameters, we store each parameter value assignment in a different row (in 
Fig. 1 only two are shown). A Learner_parameter is described by the learner it 
belongs to, its name and a specification of sensible or suggested values, to facilitate 
experimentation. 

Secondly, the used Dataset, which can be instantiated with a randomization of 
the order of its attributes or examples (e.g. for incremental learners), is recorded by 
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T jRarnor parval Learner_parameter 




pid 


lid 


name 


alias 


learner_inst 


kernel_inst 


iefault 




max 


(sugg) 


15 


64 


c 


conf. threshold 


false 


false 


0.25 


0.01 


0.99 




15 


65 


M 


min nr inst/leaf 


false 


false 


2 


2 


20 





Learner 



lid 


name 


version 


url 


class 


(charact) 


15 


J48 


1.2 


http; //. . . 


tree 





Machine 



mach_id 


corr_fact 


(props) 


ng-06-04 


1 





eid 


earner_inst 


data_inst 


eval_meth 


type 


status 


priority 


machine 


error 


(backgr_info) 


13 


15 


1 


1 


ilassificat 


done 


9 


ng-06-04 








Dataset 



did 


name 


origin 


url 


class_index 




def_acc 


(charact) 


230 


anneal 


uci 


http://. .. 


-1 


898 


0.7617 





















Eval_meth_parval 



iet5pF^ 



^ Evaluation 



eid 


cputime 


memory 


pred_acc 


nn_abs_eri 


conf_mat 


(metrics) 


13 


0:0:0:0.55 


226kb 


0.9844 


0.0056 







eid 


inst 


class 


prob 


predicted 


13 


1 


3 


0.7235 


true 



Fig. 1. A simplified schema of the experiment database. 



its name, download url(s), the index of the class attribute and some information on 
its origin (e.g. to which repository it belongs or how it was generated artificially). 
In order to investigate whether the performance of an algorithm is linked to certain 
kinds of data, a large set of dataset characterization metrics is stored, most of which 
are described in Peng et al. (2002). These can be useful to help gain insight into an 
algorithm’s behavior and, conversely, assess a learner’s suitability for handling new 
learning problems ^ 

Finally, we must store an evaluation of the experiments. The evaluation method 
(e.g. cross-validation) is stored together with its parameters (e.g. the number of 
folds). If a dataset is divided into a training set and a test set, this is defined in table 
Testset_of. The result of the evaluation of each experiment is described in table 
Evaluation by a wide range of evaluation metrics for classification, including the 
contingency tables. To compare cpu times, a factor describing the relative speed of 
the used Machine is stored as part of the machine description. The last table in Fig. 1 
stores the (probabilities of the) predictions returned by each experiment, which may 
be used to calculate new performance measures without rerunning the experiments. 



3 The experiments 

To populate the database with experiments, we selected 54 classification algorithms 
from the WEKA platform (Witten and Frank (2005)) and inserted them together with 

* New data and algorithm characterizations can be added at any time by adding more columns 
and calculating the characterizations for all datasets or algorithms. 
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all their parameters. Also, 86 commonly used classification datasets were taken from 
the UCI repository and inserted together with their calculated characteristics. Then, 
to generate a sample of classification experiments that covers a wide range of con- 
ditions, while also allowing to test the performance of some algorithms under very 
specific conditions, some algorithms were explored more thoroughly than others. 

First, we ran all experiments with their default parameter settings on all datasets. 
Secondly, we defined sensible values for the most important parameters of the algo- 
rithms SMO (which trains a support vector machine), MultilayerPerceptron, J48 (a 
C4.5 implementation), IR (a simple rule learner) and Random Forests (an ensemble 
learner) and varied each of these parameters one by one, while keeping all other pa- 
rameters at default. Finally, we further explored the parameter spaces of J48 and IR 
by selecting random parameter settings until we had about 1000 experiments on each 
dataset. For all randomized algorithms, each experiment was repeated 20 times with 
different random seeds. All experiments (about 250,000 in total) where evaluated 
using 10-fold cross-validation, using the same folds for each dataset. 

An online interface is available at ht tp ; //www.cs .kuleuven.be/~dtai/expdb/ for 
those who want to reuse experiments for their own purposes, together with a full 
description and code which may be of use to set up similar databases, for example to 
store, analyse and publish the results of large benchmark studies. 



4 Using the database 

We will now illustrate how easy it is to use this experiment database to investigate 
a wide range of questions on the behavior of learning algorithms by simply writing 
the right queries and interpreting the results, or by applying data mining algorithms 
to model more complex interactions. 

4.1 Comparing different algorithms 

A first question may be “How do all algorithms in this database compare on a spe- 
cific dataset D?” To investigate this, we query for the learning algorithm name and 
evaluation result (e.g. predictive accuracy), linked to all experiments on (an instance 
of) dataset D, which yields the following query: 

SELECT l.name, v.pred_acc 

FROM experiment e, learner_inst li, learner 1, data_inst di, 
dataset d, evaluation v 

WHERE v.eid = e.eid and e.learner_inst = li.liid and li.lid = l.lid 
and e.data_inst = di.diid and di.did = d.did and d.name='D' 

We can now interpret the returned results, e.g. by drawing a scatterplot. For 
dataset monks -problems -2 (a near-parity problem), this yields Fig. 2, giving a clear 
overview of how each algorithm performs and (for those algorithms whose param- 
eters where varied) how much variance is caused by different parameter settings. 
Only a few algorithms surpass default accuracy (67%) and while some cover a wide 
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Fig. 2. Algorithm performance comparison on the monks-problems-2_test dataset. 



spectrum (like J48), others jump to 100% accuracy for certain parameter settings 
(SMO with higher-order polynomial kernels and MultilayerPerceptron when enough 
hidden nodes are used). 

We can also compare two algorithms A1 and A2 on all datasets by joining their 
performance results (with default settings) on each dataset, and plotting them against 
each other, as shown in Fig. 3. Moreover, querying also allows us to use aggregates 
and to order results, e.g. to directly build rankings of all algorithms by their average 
error over all datasets, using default parameters: 

SELECT l.name, avg {v.mn_abs_err ) AS avg_err FROM experiment e, 
learner 1, learner_inst li, evaluation v WHERE v.eid = e.eid and 
e . learner_inst = li.liid and li.lid = l.lid and li. default = true 
GROUP BY 1 . name ORDER BY avg_err asc 

Similar questions can be answered in the same vein. With small adjustments, we 
can query for the variance,. . . of each algorithm’s error (over all or a single dataset), 
study how much error rankings differ from one dataset to another, or study how 
parameter optimization affects these rankings. 




SELECT si. name, avg(sl.pred_acc) AS Al_acc, avg (s2 .pred_acc) AS 
A2_acc FROM (SELECT d.name, e.pred_acc FROM .. WHERE l.name = 'Al’ 
... ) AS si JOIN (SELECT d.name, e.pred_acc FROM .. WHERE l.name = 
'A2' ... ) AS s2 ON si. name = s2.name GROUP BY si. name 



Fig. 3. Comparing relative performance of J48 and OneR with a single query. 
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4.2 Querying for parameter effects 

Previous queries generalized over all parameter settings. Yet, starting from our first 
query, we can easily study the effect of a specific parameter P by “zooming in” on 
the results of algorithm A (by adding this constraint) and selecting the value of P 
linked to (an instantiation of) A, yielding Fig. 4a: 

SELECT v.pred_acc, Iv. value 

FROM experiment e, learner_inst li, learner 1, data_inst di, 
dataset d, evaluation v, learner_parameter Ip, learner_parval Iv 
WHERE v.eid = e.eid and e.learner_inst = li.liid and li.lid = l.lid 
and l.name='A' and Iv. liid=li . liid and Iv.pid = Ip.pid and Ip. 
name= ' P ' and e.data_inst = di.diid and di.did = d.did and d.name='D' 

Sometimes the effect of a parameter P may be dependent on the value of another 
parameter. Such a parameter P2 can however be controlled (e.g. by demanding its 
value to be larger than V) by extending the previous query with a constraint requiring 
that the learner instances additionally are amongst those where parameter P2 obeys 
those constraints. 

WHERE ... and Iv.liid IN 

(SELECT Iv.liid FROM learner_parval Iv, learner_parameter Ip 
WHERE Iv.pid = Ip.pid and lp.name='P2' and lv.value>V) 

Launching and visualizing such queries yield results such as in Fig. 4, clearly 
showing the effect of the selected parameter and the variation caused by other pa- 
rameters. As such, it is immediately obvious how general an observed trend is: all 
constraints are explicitly mentioned in the query. 





Fig. 4. The effect of the minimal leafsize of J48 on monks-problems-2_test (a), after re- 
quiring binary trees (b), and after also suppressing reduced error pruning (c) 
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4.3 Querying for the effect of dataset properties 

It also becomes easy to investigate the interactions between data properties and learn- 
ing algorithms. For instance, we can use our experiments to study the effect of a 
dataset’s size on the performance of algorithm A^: 

SELECT v.pred_acc, d.nr_examples 

FROM experiment e, learner_inst li, learner 1, data_inst di, 
dataset d, evaluation v 

WHERE v.eid = e.eid and e.learner_inst = li.liid and li.lid = l.lid 
and l.name='A' and e.data_inst = di.diid and di.did = d.did 

4.4 Applying data mining techniques to the experiment database 

There can be very complex interactions between parameter settings, dataset charac- 
teristics and the resulting performance of learning algorithms. However, since a large 
number of experimental results are available for each algorithm, we can apply data 
mining algorithms to model those interactions. 

For instance, to automatically learn which of J48’s parameters have the greatest 
impact on its performance on monks-problems-2_test (see Fig. 4), we queried 
for the available parameter settings and corresponding results. We discretized the 
performance with thresholds on 67% (default accuracy) and 85%, and we used J48 
to generate a (meta-)decision tree that, given the used parameter settings, predicts in 
which interval the accuracy lies. The resulting tree (with 97.3% accuracy) is shown 
in Fig. 5. It clearly shows which are the most important parameters to tune, and how 
they affect J48’s performance. 

Likewise, we can study for which dataset characteristics one algorithm greatly 
outperforms another. Starting from the query in Fig. 3, we additionally queried for 
a wide range of data characteristics and discretized the performance gain of J48 
over IR in three classes: “draw”, “win_J48” (4% to 20% gain), and “large_win_J48” 
(20% to 70% gain). The tree returned by J48 on this meta-dataset is shown in Fig. 6, 
and clearly shows for which kinds of datasets J48 has a clear advantage over OneR. 



I binary .splits I 



|minJnst<=3 I |minJnst<=12 I 
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Fig. 5. Impact of parameter settings. 
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Fig. 6. Impact of dataset properties. 



^ To control the value of additional dataset properties, simply add these constraints to the list: 
WHERE . . . and d.nr_attributes>5. 
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4.5 On user-friendliness 

The above SQL queries are relatively complicated. Part of this is however a conse- 
quence of the relatively complex structure of the database. A good user interface, 
including a graphical query tool and an integrated visualization tool, would greatly 
improve the usability of the database. 



5 Conclusions 

We have presented an experiment database for classihcation, providing a 
well-structured repository of fully described classihcation experiments, thus allow- 
ing them to be easily verihed, reused and related to theoretical properties of algo- 
rithms and datasets. We show how easy it is to investigate a wide range of questions 
on the behavior of these learning algorithms by simply writing the right queries and 
interpreting the results, or by applying data mining algorithms to model more com- 
plex interactions. The database is available online and can be used to gain new in- 
sights into classiher learning and to validate and rehne existing results. We believe 
this database and underlying software may become a valuable resource for research 
in classihcation and, more broadly, machine learning and data analysis. 
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Abstract. Conjoint analysis is a widely used method in marketing research. Some problems 
occur when conjoint analysis is used for complex services where the perception of and the 
preference for attributes and levels considerably varies among individuals. Clustering and 
clusterwise estimation procedures as well as Hierarchical Bayes (HB) estimation can help 
to model this perceptual uncertainty and preference heterogeneity. In this paper we analyze 
the advantages of clustering and clusterwise HB as well as combined estimation procedures 
of collected preference data for complex services and therefore extend the analysis of Sends 
and Li (2002). 



1 Introduction 

Conjoint analysis is a “... method that estimates the structure of consumer’s pref- 
erences ...” (Green and Srinivasan (1978), p. 104). Typically, hypothetical concepts 
for products or services (attribute-level-combinations) are presented to and rated by 
a sample of consumers in order to estimate part worths for attribute-levels from a 
consumer’s point of view and to develop acceptable products or services. Since its 
introduction into marketing in the early 1970s conjoint analysis has become a favored 
method within marketing research (see, e.g., Green et al. (2001)). 

Consequently, conjoint analysis is nowadays a method for which a huge number 
of applications are known as well as many specialized tools for data collection and 
analysis have been developed. For part worth estimation, especially clusterwise esti- 
mation procedures (see, e.g., Baier and Gaul (1999, 2003)) and Hierarchical Bayes 
(HB) estimation (see, e.g., Allenby and Ginter (1995), Lenk et al. (1996)) seem to be 
attractive newer developments. 

After a short discussion of specific problems of conjoint analysis when applied 
to preference measurement for (complex) services (section 2) we propose estimation 
procedures basing on HB and clustering to reduce these problems (section 3). An 
empirical investigation (section 4) shows the viability of this proposition. 
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2 Preference measurement for services 

The concepts evaluated by consumers within a conjoint study can be hypothetical 
products as well as - with an increasing importance during recent years - services. 
However, services cause special demands on the research design due to their follow- 
ing peculiarities (e.g., Zeithaml et al. (1985), pp. 33): immateriality, integration of an 
external factor, non-standardization, and perishability. 

The main peculiarity is that services cannot be taken into the hands - they are 
immaterial. This leads to problems during the data collection phase, where hypothet- 
ical services have to be presented to the consumer. It has been shown that the “right” 
description and presentation influences the “right” perception of consumers and con- 
sequently the validity of the estimated part worths from the collected data (see, e.g., 
Ernst and Sattler (2000), Brusch et al. (2002)). 

Furthermore, as we all know, the quality of services depends on the producing 
persons and objects as well as their interaction with persons and objects from the 
demand side - the so-called external (production) factors. Their synergy, willingness, 
and quality often cannot be evaluated before consumption. Perceptual uncertainty of 
the usefulness of different attributes and levels as well as preference heterogeneity 
is common among potential buyers. Part worth estimates for attribute levels have to 
take this into account. Intra- and inter-individual variation has to be modelled. 



3 Hierarchical Bayes procedures for conjoint analysis 

Recently, for modelling this intra- and inter-individual variation, clustering and clus- 
terwise part worth estimation as well as HB estimation have been proposed. 

Clustering and clusterwise part worth estimation provide traditional ways to 
model preference heterogeneity in conjoint analysis (see, e.g., Baier and Gaul (1999, 
2003)). The population is assumed to fall into a number of (unknown) clusters or seg- 
ments whose segment-specific part worths have to be estimated from the collected 
data. 

HB, on the other side, estimates individual part worth distributions by “borrow- 
ing” information from other individuals (see, e.g., Baier and Polasek (2003) were 
for a conjoint analysis setting this aspect of borrowing is described in detail). Prefer- 
ence heterogeneity is not assumed via introducing segments. Instead, the deviation of 
the individual part worth distributions from a mean part worth distribution is derived 
from the collected individual data (for methodological details and new developments 
see, e.g., Allenby et al. (1995), Lenk et al. (1996), Andrews et al. (2002), Liechty et 
al. (2005)). 

The main advantages and therefore the reasons for the attention of HB can be 
summarized as follows (Orme 2000): 

• HB estimation seems (at least) to outperform traditional models with respect to 
predictive validity. 

• HB estimation seems to be robust. 
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• HB permits - even with little data - individual part worth estimation and there- 
fore allows to model heterogeneity across respondents. 

• HB helps differentiating signal from noise. 

• HB and its “draws” (replicates) model uncertainty and therefore provide a rich 
source of information. 

If facts and statements about HB are considered, it is not surprising, that the 
impression results that - especially in case of standard products and services - “HB 
methods achieve an ‘analytical alchemy’ by producing information where there is 
very little data ...” (Sentis and Li (2002), p. 167). 

However, question 1: whether this is also true when complex services have to 
be analyzed, and - in this case - question 2: whether instead of HB or clustering a 
combination of these procedures should be used are still open. 

Our investigation tries to close this gap: Clustering and clusterwise HB as well as 
combined estimation procedures are applied to collected preference data for complex 
services. The results are compared with respect to predictive validity. The investiga- 
tion extends the analysis of Sentis and Li (2002) who observed in a simpler setting 
that predictive validity (hit rates) were not improved by combining clustering and 
HB estimation. 



4 Empirical investigation 

4.1 Research object 

For our investigation a complex service is used: an university course of study with 
new e-learning features, e.g., different possibilities to join the lecture (in a lecture hall 
or at home using video conferencing) or different types of scripts (printed scripts or 
multimedia scripts with interactive exercises). Here, complexity is used as term to 
differentiate from simple aspects of services (e.g., price, opening hours, processing 
time). Because of this complexity of the attributes and levels we expect perceptual 
uncertainty (because of difficultly describable attributes and levels) as well as pref- 
erence heterogeneity among consumers have to be considered. 

The research object has four attributes, each with three or four levels (for the 
structure see Table 1). In total, 15 part worth parameters have to be estimated in our 
analyzes. 

4.2 Research design 

A conjoint study is carried out using the nowadays standard tool for conjoint data 
collection. Sawtooth SoftwareSs ACA system (Sawtooth Software (2002)), to be 
precise ACAAVeb within SSIAVeb (Windows Version 2.0.1b). For our investigation 
a five-step analysis is used to answer our focused questions. 
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Step 1 - Analyzing the quality 

In our study we had 239 started and 213 finished questionnaires. Standard ACA 
methodology was used for individual part worth estimation. Standard selection crite- 
ria reduced the number of usable respondents to 162 with passably good /?^-measures. 

Step 2 - Calculating standardized part worths 

The individual part worths were standardized. The attribute level with the lowest 
(worst) part worth is becoming 0, the best attribute level combination (combination 
of the best attribute levels of each attribute) 1. In the following, these standardized 
individual part worths were used for clustering. 

Step 3 - Clustering 

The sample was divided into two segments (cluster 1 and cluster 2) by means of a 
cluster analysis and an elbow criterion. The cluster analysis uses Euclidean distances 
and Ward’s method and is based on the standardized individual part worths. From the 
resulting dendrogram it could be seen that cluster 2 is far more heterogenous than 
cluster 1. 

As shown in Table 1 two clusters with a few differences were found. For example, 
the different order of the relative importance of the attributes is noticeable. For cluster 
1 the most relevant (important) attribute is attribute 3. For cluster 2 - where the 
relative importance of the attributes is more uniformly distributed - the most relevant 
attribute is instead attribute 2. 

Step 4 - Computing HB utilities 

The distribution of individual part worths were computed via aggregated HB as well 
as via two clusterwise HB part worth estimations. For our analysis, the software 
ACA/HB from Sawtooth Software, Inc. is used (Sawtooth Software (2006)), the ac- 
tual most relevant standard tool for conjoint data analysis. Preprocessing in order to 
segment the available individual data was done via SAS. The following parameters 
are set: 

• 5,000 iterations before using results (burn in), 

• 10,000 draws to be used for each respondent, 

• no constraints in use, 

• htting pairs & priors, and 

• saving random draws. 

Thus, 10,000 draws from the individual part worth distribution are available for 
each respondent from aggregated HB as well as two clusterwise HB estimations 
resulting in three samples (total sample, cluster 1, cluster 2). These HB utilities will 
be used to answer our research questions. 
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Table 1. Conjoint results for the total sample and the clusters 







Total sample 


Cluster 1 


Cluster 2 






(n=162) 


(n=i 


80) 


(n= 


82) 






Rel. Imp. 


PW 


Rel. Imp. 


PW 


Rel. Imp. 


PW 


Attribute 1 


Level 1 




0.032 




0.044 




0.020 




Level 2 


18.2 % 


0.117 


14.4 % 


0.080 


21.8 % 


0.154 




Level 3 




0.140 




0.095 




0.184 


Attribute 2 


Level 1 




0.106 




0.177 




0.036 




Level 2 


27.5 % 


0.157 


28.5 % 


0.183 


26.5 % 


0.132 




Level 3 


0.238 


0.236 


0.240 




Level 4 




0.081 




0.036 




0.124 


Attribute 3 


Level 1 




0.256 




0.317 




0.196 




Level 2 


29.2 % 


0.145 


33.1 % 


0.131 


25.4 % 


0.159 




Level 3 


0.163 


0.165 


0.161 




Level 4 




0.019 




0.017 




0.021 


Attribute 4 


Level 1 




0.199 




0.218 




0.181 




Level 2 


25.1 % 


0.122 


24.0 % 


0.068 


26.2 % 


0.174 




Level 3 


0.147 


0.097 


0.195 




Level 4 




0.043 




0.050 




0.035 



Rel. Imp. . . . relative importance, PW . . . part worths 



Step 5 - Calculating values for predictive validity 

The predictive validity was considered while questioning on the basis of the inte- 
gration of a specific holdout task. This task included the evaluation of five service 
concepts, similar to the “calibration concepts” of a usual ACA questionnaire. The re- 
spondents were asked for the “likelihood of using”. This holdout task was separated 
from the conjoint task of the ACA questionnaire. 

Predictive validity will be measured using two values: the Spearman rank-order 
correlation coefficient and the first-choice-hit-rate. The Spearman rank-order corre- 
lation compares the predicted preference values with the corresponding observed or- 
dinal scale response data from the holdout task. The first-choice-hit-rate is the share 
of respondents where the stimulus with the highest predicted preference value is also 
the one with the highest observed preference value. 

4.3 Results 

The results of our investigation are shown in Tables 2 and 3. Table 2 shows the va- 
lidity values for the traditional ACA estimation for each partial sample. The validity 
values are based on the averages of the traditional (standardized) ACA part worths. 

As you can see in Table 2 the validity values for cluster 1 are higher than the 
values for the total sample. Cluster 2 has instead the lowest (worst) validity results. 
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Table 2. Validity values for the total sample and for the clusters for traditional ACA estimation 
(using standardized part worths from step 2 at the individual level) 





Total sample 


Cluster 1 


Cluster 2 




(n=161)* 


(n=79)* 


(n=82) 


First-choice-hit-rate 
(using individual data) 


62.11 % 


73.42 % 


51.22 % 


Mean Spearman 
(using individual data) 


0.735 


0.782 


0.689 



* . . . one respondent had missing holdout data and could not be considered 



The validity values shown in Table 3 are based on the HB estimation and are 
given for the total sample and for the two clusterwise estimations. The clusters are 
separated after the membership during the estimation (total sample or segment). The 
description “in total sample” means that the HB utilities of the respondents were 
computed by “borrowing” information from the total sample (not only from mem- 
bers of the own segment). Thus, the HB estimation happened for all respondents 
together, but the validity values for the two clusters were computed later separately. 
On the other hand, the description “in segment” means that the HB utilities of the 
respondents were computed by “borrowing” information only from members of the 
own segment (clusterwise HB estimation). 

Furthermore, the results in Table 3 are distinguished according to the data basis. 
The validity values are shown for the computation based on the 10,000 draws (10,000 
HB utilities) for each respondent and for the computation based on the mean HB 
utilities (one HB utility as mean of 10,000 draws (iterations)) for each individual. 

From Table 3 it is identifiable that the validity values in cluster 1 are higher and 
in cluster 2 lower than in the total sample. Further differences between the clusters 
can be found when looking at the HB estimation basis (joint estimation in the total 
sample (“in total sample”) or clusterwise estimation (“in segment”)). Here for cluster 
1 the results in the case of a joint estimation are better in most cases than a clusterwise 
estimation whereas the opposite can be seen in cluster 2. 

When comparing the results of Table 3 with those of Table 2 it can be seen that 
all validity values for the individual averages based HB estimation are higher than 
for the ACA estimation, regardless which HB estimation basis (“in total sample” or 
“in segment”) is used. In the case of HB estimation using individual draws, a mixed 
result with respect to validity can be found. 



5 Conclusion and outlook 

The focused questions of our investigation can be answered. The first question was, 
whether HB estimation can produce “better results” than traditional part worth es- 
timation when complex services have to be analyzed. This can be affirmed for the 
usage of individual means, regardless whether the total sample or the segments are 
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Table 3. Validity values for the total sample and for the clusters for HB estimation (“in to- 
tal sample”: HB estimation at the individual total sample level; “in segment”: separate HB 
estimation at the individual cluster 1 resp. 2 level) 



Total sample 
(n=161)* 



Cluster 1 
(n=79)* 

In Total In 



Cluster 2 
(n=82) 

In Total In 







Sample 


Segment 


Sample 


Segment 


Firct-rhniVp-hit-rflfp 


(using draws, n= 10,000) 


62.57 % 


72.38 % 


72.39 % 


53.12% 


53.14 % 


Mean Spearman 
(using draws, n= 10,000) 


0.727 


0.780 


0.778 


0.677 


0.671 


Firct-rhmVp-hit-rflfp 


(using mean draws) 


65.22 % 


75.95 % 


74.68 % 


54.88 % 


57.32 % 


Mean Spearman 
(using mean draws) 


0.748 


0.802 


0.797 


0.696 


0.700 



* ... one respondent had missing holdout data and could not be considered 



considered. Furthermore we were interested whether clusterwise estimation can op- 
timize the “results” of HB estimation. A clear answer is not possible up to now. In 
our empirical investigation in some cases we had improvements with respect to the 
validity values (cluster 2) and in some cases not (cluster 1). 

This means that our proposition in the paper can help to reduce the problems that 
occur when service preference measurement via conjoint analysis is the research 
focus. HB estimation seems to improve validity even in case of complex services 
with immaterial attributes and levels that cause perceptual uncertainty and preference 
heterogeneity. However, going further with the more complicated way of performing 
clusterwise HB estimation doesn’t provide automatically better results. 

Nevertheless, further comparisons with larger sample sizes and other research ob- 
jects are necessary. Furthermore, the possibilities of other validity criteria for clearer 
statements could be used. 
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Abstract. The discovery of association rules is a popular approach to detect cross-category 
purchase correlations hidden in large amounts of transaction data and extensive retail assort- 
ments. Traditionally, such item or category associations are studied on an ’average’ view of the 
market and do not reflect heterogeneity across customers. With the advent of loyalty programs, 
however, tracking each program member’s transactions has become facilitated, enabling re- 
tailers to customize their direct marketing efforts more effectively by utilizing cross-category 
purchase dependencies at a more disaggregate level. In this paper, we present the building 
blocks of an analytical framework that allows retailers to derive customer segment-specific 
associations among categories for subsequent target marketing. The proposed procedure starts 
with a segmentation of customers based on their transaction histories using a constrained ver- 
sion of A-centroids clustering. In a second step, associations are generated separately for each 
segment. Finally, methods for grouping and sorting the identifled associations are provided. 
The approach is demonstrated with data from a grocery retailing loyalty program. 



1 Introduction 

One central goal of customer relationship management (CRM) is to target customers 
with offers that best match their individual consumption needs. Thus, the question 
of who to target with which range of products or items emerges. Most previous re- 
search in CRM or direct marketing concentrates on the issue who to target (for an 
extensive literature review see, e.g., Prinzie and Van den Poel (2005)). We address 
both parts of this question and introduce the cornerstones of an analytical framework 
for customizing direct marketing campaigns at the customer segment level. 

In order to identify and to make use of possible cross-selling potentials, the pro- 
posed approach builds on techniques for exploratory analysis of market basket data. 
Retail managers have been interested in better understanding the purchase interde- 
pendency structure among categories for quite a while. One obvious reason is that 
knowledge about correlated demand patterns across several product categories can 
be exploited to foster cross-buying effects using suitable marketing actions. For ex- 
ample, if customers often buy a particular product A together with article B, it could 
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be useful to promote A in order to boost sales volumes of B, and vice versa. The ob- 
jective of exploratory market basket analysis is to discover such unknown cross-item 
correlations from a typically huge collection of purchase transaction data (so-called 
market baskets) accruing at the retailer’s point-of-sale scanning devices (Berry and 
Linoff (2006)). Among others, algorithms for mining association rules are popular 
techniques to accomplish this task (cf., e.g. Hahsler et al. (2006)). However, such 
association rules are typically derived for the entire data set of available retail trans- 
actions and thus reflect an ’average’ or aggregate view of the market only. 

In recent years, many retailers have tried to improve their CRM activities by 
launching loyalty programs, which provide their members with bar-coded plastic 
or registered credit cards. If customers use these cards during their payment process, 
they get a bonus, credits or other rewards. As a side effect, these transactions become 
personally identifiable by linking them back to the corresponding customers. Thus, 
retailers are nowadays collecting series of market baskets that represent (more or 
less) complete buying histories of their primary clientele over time. 



2 A segment-specific view of cross-category associations 

To exploit the potential benefits offered by such rich information on customers’ pur- 
chasing behavior within advanced CRM programs, cross-category correlations need 
to be detected on a more disaggregate (or customer segment) level instead of an 
aggregate level. Attempts towards this direction are made by Boztug and Reutterer 
(2007) or Reutterer et al. (2006). The authors employ vector quantization techniques 
to arrive at a set of ’generic’ (i.e., customer-unspecific) market basket classes with 
internally more distinctive cross-category interdependencies. In a second step they 
generate a segmentation of households based on a majority voting of each house- 
hold’s basket class assignments throughout the individual purchase history. These 
segments are proposed as a basis for designing customized target marketing actions. 

In contrast to these approaches, the procedure presented below adopts a novel 
centroids-based clustering algorithm proposed by Leisch and Griln (2006), which 
bypasses the majority voting step for segment formation. This is achieved by a cross- 
category effects sensitive partitioning of the set of (non-anonymous) market basket 
data, which imposes group constraints determined by the household labels associated 
with each of the market baskets. Hence, during the iterative clustering process the 
single transactions are "forced" to keep linked with all the other transactions of a 
specific household’s buying history. This results in segments whose members can be 
characterized by distinctive patterns of cross-category purchase interrelationships. 

To get a better feeling of the inter-category purchase correlations within the pre- 
viously identified segments, association rules derived separately for each segment 
and evaluated by calculating various measures of significance and interestingness 
can assist marketing managers for further decision making on targeted marketing ac- 
tions. Although the within-segment cross-category associations are expected to differ 
significantly from those generated for the unsegmented data set (because of the data 
compression step employed prior to the analysis), low minimum thresholds of such 
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measures typically still result in a huge number of potentially interesting associa- 
tions. To arrive at a clearer and managerially more traceable overview of the various 
segment-specific cross-category purchase correlations, we arrange them based on a 
distance concept suggested by Gupta et al. (1999). 

The next section characterizes the building blocks of the employed methodology 
in more detail. Section 4 empirically illustrates the proposed approach using a trans- 
action data set from a grocery retailing loyalty program and presents selected results. 
Section 5 closes the article with a summary and an outlook on future research. 



3 Methodology 

The conceptual framework of the proposed approach is depicted in Figure 1 and con- 
sists of three basic steps: First, a modified ^f-centroids cluster algorithm partitions 
the entire transaction data set and defines K segments of households with an inter- 
est in similar category combinations. Secondly, the well-known APRIORI algorithm 
(Agrawal et al. (1993)) searches within each segment for specific frequent itemsets, 
which are filtered by a suitable measure of interestingness. Finally, the associations 
are grouped via hierarchical clustering using a distance measure for associations. 



Xn 

K-centroid cluster 




Association mining 
within segment k = 1 




Association mining 
within segment k = 2 


algorithm holding the 
linkage to Ip 




Association mining 
within segment k = K 



STEP 1 STEP 2 



Filtering, grouping 
and sorting of 
mined associations 
within each 
segment 



STEP 3 



Fig. 1. Conceptual framework of the proposed procedure 



Step r. Each transaction or market basket can be interpreted as a /-dimensional 
binary vector x„ = [1,0]'^ with j = 1,2.../ categories. A value of one refers to the 
presence and a zero to the absence of an item in the market basket. Integrated into a 
binary matrix Ajv, the rows correspond to transactions while each column represents 
an item. Let the set Ip describe a group constraint indicating the buying history of 
customer p= 1,2, .. .P with {x, G Xn\i G Ip}. The objective function for a modified 
^f-centroids clustering respecting group constraints is (Leisch and Griin (2006)): 

p 

D{Xn,Ck) =^'^'^d{xi,c{Ip)) ^ min (1) 

P=i ieip 

An iterative algorithm for solving Equation 1 requires calculation of the closest 
centroid c(.) for each transaction x, according to the distance measure d{.) at each 
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iteration. To cope with the usually sparse binary transaction data and to make the 
partition cross-category effects sensitive, the Jaccard coefficient, which gives more 
weight the co-occurrences of ones rather than common zeros, is used as an appropri- 
ate distance measure (cf. Decker (2005)). Notice that in contrast to methods like the 
/T-means algorithm, instead of single transactions groups of market baskets as given 
by Ip (i.e., customer p’s complete buying history) need to be assigned to a minimum 
distant centroid. This is warranted by a function /(x,) that determines the centroid 
closest to the majority of the grouped transactions (cf. Leisch and Griln (2006)). 

In order to achieve directly accessible and more intuitively interpretable results, 
we can calculate cluster-wise means for updating the prototype system instead of 
optimized canonical binary centroids. This results in an ’expectation-based’ cluster- 
ing solution (cf. Leisch (2006)), whose centroids are equivalent to segment-specific 
choice probabilities of the corresponding categories. Notice that the segmentation 
of households is determined such that each customer’s complete purchase history 
points exclusively to one segment. Thus, in the present application context the set 
of K centroids can be interpreted as prototypical market baskets that summarize the 
most pronounced item combinations demanded by the respective segment members 
throughout their purchase history. An illustrative example is provided in Table 1 of 
the subsequent empirical study. 

Step 2: The centroids derived in the segmentation step already provide some 
indications on the general structure of the cross-item interdependencies within the 
household segments. To get a more thorough understanding, interesting category 
combinations (so called itemsets) can be further explored by the APRIORl algorithm 
using a user defined support value. For the entire data set, the support of an arbitrary 
itemset A is denoted by supp{A) = \ G | A C x„} | / | A | and defines the fraction 
of transactions containing itemset A. Notice that in the present context, however, 
itemsets are generated at the level of previously constructed segments. 

The itemsets are called frequent if their support is above a user-defined thresh- 
old value, which implies their sufficient statistical importance for the analyst. To 
generate a wide range of associations, rather low minimum support values are usu- 
ally preferred. Because not all associations are equally meaningful, an additional 
measure of interestingness is required to filter the itemsets for evaluation purposes. 
Since our focus is on itemsets, asymmetric measures like confidence or lift are less 
useful (cf. Hahsler (2006)). We advocate here the so-called all-confidence measure 
introduced by Omiecinski (2003), which is the minimum confidence value for all 
rules that can be generated from the underlying itemset. Formally it is denoted by 
allconf{A) = supp{A) / maxB<zA{supp{B)} for all frequent subsets B with B CA. 

Step 3: Although the all-confidence measure can assist in reducing the number of 
itemsets considerably, in practice it can still be difficult to handle several hundreds of 
remaining associations. For an easier recognition of characteristic inter-item corre- 
lations within each segment, the associations can be grouped based on the following 
Jaccard-like distance measure for itemsets (Gupta et al. (1999)): 

I m(AUB) I 

I m(A) I + I m(B) \ — \ m(A\JB) \ 



D{A,B) = 1 



( 2 ) 
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Expression m(.) denotes the set of transactions containing the itemset. From 
Equation 2 it should be evident that the distance between two itemsets tends to be 
lower if the involved itemsets occur in many common transactions. This property 
qualifies the measure to determine specific groups of itemsets that share some com- 
mon aspects of consumption behavior (cf. Gupta et al. (1999)). 



4 Empirical application 

The following empirical study Illustrates some of the results obtained from the proce- 
dure described above. We analyzed two samples of real-world transaction data, each 
realized by 3,000 members of a retailer’s loyalty program. The customers made on 
average 26 shopping trips over an observational period of one year. Each transaction 
contains 268 binary variables, which represent the category range of the assortment. 

To achieve managerially meaningful results, preliminary screening of the data 
suggested the following adjustments of the raw data: 

1. The purchase frequencies are clearly dominated by a small range of categories, 
such as fresh milk, vegetables or water (see Figure 2). Since these categories 
are bought several times by almost every customer during the year under in- 
vestigation, they provide relatively low information on the differentiated buying 
habits of the customers. The opposite is supposed to be true for categories with 
intermediate or lower purchase frequencies. Therefore, we decided to eliminate 
the upper 52 categories (left side of the vertical line in Figure 2), which occur 
in more than 10% of all transactions. The resulting empty baskets are excluded 
from the analysis as well. 



0.5 -1 
0.4 - 
0.3 - 
0.2 - 
0.1 -■ 
0.0 - 



Fig. 2. Distribution of relative category purchase frequencies in decreasing order 



2. To include households with sufficiently large buying histories, households with 
less than six store visits per year were eliminated. In addition, the upper hve 
percentage quantile of households, which use their customer cards extremely 
often, were deleted. 

To find a sufficiently stable cluster solution with a minimum within-sum of dis- 
tances, the transactions made by the households from the first sample are split into 
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three equal sub samples and clustered up to fifteen times each. In each case, the 
best solution is kept for the following sub sample to achieve stable results. The con- 
verged set of centroids of the third sub sample is used for initialization of the second 
sample. Commonly used techniques for determination of the number of clusters rec- 
ommended A" = 1 1 clusters as a decent and well-manageable number of household 
segments. Given these specifications, the partitioning of the second sample using the 
proposed cluster algorithm detects some segments, which are dominated by category 
combinations typically bought for specific consumption or usage purposes and other 
types of categorical similarities. For example. Table 1 shows an extract of a centroid 
vector including the top six categories in terms of highest conditional purchase prob- 
abilities in a segment of households denoted as the "wine segment". A typical market 
basket arising from this segment is expected to contain red/rose wines with a proba- 
bility of 32.3 %, white wines with a probability of 22.5 %, etc. Hence, the labeling 
"wine segment". 

Equally, other segments may be characterized by categories like baby food/care 
or organic products. On the other hand, there is also a small number of segments with 
category interrelationships that cannot be easily explained. However, such segments 
might provide some interesting insights into the interests of households which are so 
far unknown. 



Table 1. Six categories with highest purchase frequencies in the wine segment 



No. Category 

1 . red / rose wines 

2. white wines 

3. sparkling wine 

4. condensed milk 

5. appetizers 

6. cooking oil 



Purchase frequency 

0.3229143 

0.2252356 

0.1225006 

0.1206619 

0.1080211 

0.1066422 



According to the second step of the proposed framework, frequent itemsets are 
generated from the transactions within the segments. Since we want to mine a wide 
range of associations, a quite low minimum support threshold is chosen (e.g., supp = 
1%). In addition, all frequent Itemsets are required to include at least two categories. 
Taking this into account, the APRIORI algorithm finds 704 frequent itemsets for the 
transactions of the wine segment. To reduce the number of associations and to focus 
on the most interesting frequent itemsets, only the 150 itemsets with highest all- 
confidence values are considered for grouping according to step 3 of the procedure. 

Grouping the frequent itemsets intends to rearrange the order of the generated 
(segment-specific) associations and to focus the view of the decision maker on char- 
acteristic item correlations. The distance matrix derived by Equation 2 is used as 
input for hierarchical clustering according to the Ward algorithm. Eigure 3 shows the 
dendrogram for the 150 frequent itemsets within the wine cluster. Again, it is not 
straightforward to determine the correct number of groups gh. Erequently proposed 
heuristics based on plotted heterogeneity measures does not help here. Therefore, we 
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pass the distance matrix to the partition around medoid (PAM) algorithm of Kauf- 
man and Rousseeuw (2005) for several g/, values. Using the maximum value of the 
average silhouette width for a sequence of partitions thirty groups of itemsets are 
proposed. In Figure 3 the grey rectangles mark two exemplary chosen clusters of as- 
sociations. The corresponding associations of the right hand group are summarized 
in Table 2 and clearly indicate an interest of some of the wine households in hard 
alcoholic beverages. 




Fig. 3. Dendrogram of 150 frequent itemsets mined from transactions of the wine segment 



Table 2. Associations of hard alcoholic beverages within the wine segment 



No. 


association 


support 


all-confidence 


1. 


{brandy, whisky) 


0.011 


0.23 


2. 


{brandy, fruit brandy) 


0.015 


0.18 


3. 


{fruit brandy, appetizers) 


0.018 


0.17 


4. 


{brandy, appetizers) 


0.016 


0.15 


5. 


{whisky, fruit brandy) 


0.011 


0.14 



To examine whether the segment-specific associations differ from those gener- 
ated within the whole data set, we have drawn and analyzed random samples with 
the same amount of transactions as each of the segments. The comparison of the 
frequent itemsets mined in the random sample and those from the segment-specific 
transactions shows that some segment-specific association groups clearly represent 
a unique characteristic of their underlying household segment. Of course, this is not 
true in any case. For example, the association group marked by the grey rectangle 
on the left-hand side in Figure 3 can be found in almost every random sample or 
segment. It denotes correlations between categories of hygiene products. 
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5 Conclusion and future work 

We presented an approach for identification of household segments with distinc- 
tive patterns and subgroups of cross-category associations, which differ from those 
mined in the entire data set. The proposed framework enables retailers to segment 
their customers according to their past interest in specific item combinations. The 
mined segment-specific associations provide a good basis for deriving more respon- 
sive recommendations or designing special offers through target marketing activities. 

Nevertheless, the stepwise procedure has it’s natural limitations imposed by the 
fact that later steps are dependent on the outcome of former stages. A simultaneous 
approach would disburden decision makers from determining various model param- 
eters (like support thresholds, number of segments) at each stage. Another drawback 
is the ad-hoc exclusion of very frequently purchased categories, which could be sub- 
stituted in future applications by a data driven weighting scheme. 
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Abstract. The Analytic Hierarchy Process (AHP) has been of substantial impact in business 
research and particularly in managerial decision making for a long time. Although empirical 
investigations (e.g. Scholl et al. (2005)) and simulation studies (e.g. Scholz et al. (2006)) have 
shown its general potential in consumer preference measurement, AHP is still rather unpopular 
in marketing research. 

In this paper, we compare a new online version of AHP with Adaptive Conjoint Analysis 
(ACA) on the basis of a comprehensive empirical study in tourism which includes 10 attributes 
and 35 attribute levels. We particularly focus on the convergent and the predictive validity 
of AHP and ACA. Though both methods clearly differ regarding their basic conception, the 
resulting preference structures prove to be similar on the aggregate level. On the individual 
level, however, the AHP approach results in a significantly higher accuracy with respect to 
choice prediction. 



1 Preference measurement for complex products 

Conjoint Analysis (CA) is one of the most prominent tools in consumer preference 
measurement and widely used in marketing practice. However, an often stated prob- 
lem of full-profile CA is that of dealing with large numbers of attributes. This limi- 
tation is of great practical relevance because ideally all attributes and attribute levels 
that affect individual choice should be included to map a realistic choice process. 

Various methods have been suggested to provide more accurate insights into con- 
sumer preferences for complex products with many attributes (Green and Srinivasan 
(1990)). Self-Explicated (SE) approaches, e.g., are used to minimize the information 
overload by questioning the respondents about each attribute separately. But SE has 
been criticized for lacking the trade-off perspective underlying CA. Eor this reason, 
hybrid methods combining the strengths of SE and full-profile CA have been de- 
veloped. Sawtooth Software’s ACA is a commercially successful computer-based 
tool facilitating efficient preference measurements for complex products (for de- 
tails, please see Sawtooth Software (2003)). While several other approaches, such as 
the hierarchical Bayes extensions of Choice-Based Conjoint Analysis, are available 
for estimating part-worth utilities on the individual level, ACA is still the standard 
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in preference measurement for products with more than six attributes (Hauser and 
Toubia (2002), Herrmann et al. (2005)) and widely used in marketing practice (Saw- 
tooth Software (2005)). In this paper, ACA will set a common benchmark for our 
empirical comparison. 

Against this background, we introduce an online version of AHP as an alterna- 
tive tool for consumer preference measurement in respective settings. Initially, AHP 
has been developed to analyze complex decision problems by decomposing them 
hierarchically into better manageable sub-problems. It has been of substantial im- 
pact in business research and particularly in managerial decision making for a long 
time. Empirical investigations (e.g. Scholl et al. (2005)) and simulation studies (e.g. 
Scholz et al. (2006)) recently demonstrated its general potential in consumer pref- 
erence measurement. However, to the best of our knowledge, AHP has never been 
tested in a real-world online consumer survey, even though internet-based survey- 
ing gains increasing importance (Pricker et al. (2005)). In this paper, we compare an 
online version of AHP with ACA by referring to a comprehensive empirical investi- 
gation in tourism which includes 10 attributes and 35 attribute levels. 

The remainder of the paper is structured as follows: In Section 2, we briefly 
outline the methodological basis of AHP. Section 3 describes the design of the em- 
pirical study. The results are presented in Section 4 and we conclude with some final 
remarks in Section 5. 



2 The Analytic Hierarchy Process - AHP 

In AHP, a decision problem, e.g. determining the individually most preferred alterna- 
tive from a given set of products, has to be arranged in a hierarchy. It is referred to as 
the “main goal" in the following and represented by the top level of the hierarchy. By 
decomposing the main goal into several sub-problems, each of them representing the 
relation of a second level attribute category with the main goal, the complexity of the 
overall decision problem is reduced. The individual attribute categories, on their part, 
are broken down into attributes and attribute levels defining “lower" sub-problems. 
Typically, different alternatives (here: products or concepts) are considered at the 
bottom level of the hierarchy. But due to the large number of hypothetical products, 
or rather “stimuli" in the CA terminology, the use of incomplete hierarchies only 
covering attribute levels, instead of complete stimuli at the bottom level, is advis- 
able. 

For the evaluation of summer vacation packages-the objects of investigation in 
our empirical study-we have structured the decision problem in a 4-level hierar- 
chy. The hierarchical structure displayed in Table 1 reflects the respondents’ average 
perceptions and decomposes the complex product evaluation problem into easy to 
conceive sub-problems. 

First, the respondents have to judge all pairs of attribute levels of each sub- 
problem on the bottom level of the hierarchy. Then, they proceed with paired compar- 
isons on the next higher level of the hierarchy, an so on. In this way, the respondents 
are first introduced to the attributes’ range and levels. 
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Table 1. Hierarchical structuring of the vacation package evaluation problem 



Attribute 

category 


Attribute 


Attribute levels 


Vacation spot 


Sightseeing offers 


1) Many 2) Some 3) Few 




Security concerns 


1) Very high 2) High 3) Average 




Climate 


1) Subtropical 2) Mediterranean 3) Desert 




Beach 


1) Lava sand 2) Sea sand 3) Shingle 


Hotel 


Leisure 


1) Fitness room 2) Lawn sport facilities 


services 


activities 


3) Aquatic sports facilities 

4) Indoor swimming pool 5) Sauna 
6) Massage parlor 




Furnishing 


1) Air conditioning 2) In-room safe 
3) Cable/satellite TV 4) Balcony 




Catering 


1) Self-catering 2) Breakfast only 
3) Halfboard 4) Full board 5) All-inclusive 


Hotel facilities 


Location 


1) Near beach 2) Near town 




Type of building 


1) Rooming house 2) Hotel complex 
3) Bungalow 




Outside facilities 


1) Several pools 2) One large pool 
3) One small pool 



In order to completely evaluate a sub-problem h with n elements, — - pair- 

wise comparisons have to be carried out. Intuitively, the hierarchically decomposition 
of complex decision problems in many small sub-problems reduces the number of 
paired comparisons that have to be conducted to evaluate the decision problem. 

Each respondent has to provide two responses for each paired comparison. First, 
the respondent has to state the direction of his or her preference for element i com- 
pared to element j with respect to an element h belonging to the next higher level. 
Second, the strength of his or her preference is measured on a 9-point ratio-scale, 
where 1 means “element i and j are equal" and 9 means “element i is absolutely pre- 
ferred to element j" (or vice versa). The respondent’s verbal expressions are trans- 
formed into priority ratios ajf, where a large ratio expresses a distinct preference 
of i over j in sub-problem h. The reciprocal value = 1 /ajf indicates the prefer- 
ence of element j over i. All pairwise comparisons of one sub-problem measured 
with respect to a higher level element h are brought together in the matrix A* (Saaty 



(1980)): 




( 1 •••<* 



Mh 



( 1 ) 



Starting from these priority ratios Ojj, the relative utility values are calculated by 
solving the following eigenvalue problem for each sub-problem h: 



\/h 



' n 



( 2 ) 
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The normalized principal right eigenvector belonging to the largest eigenvalue 
of matrix yields the vector w*, which contains the relative utility values tvf for 
each element of sub-problem h. 

An appealing feature of AHP is the computability of a consistency index (Cl), 
which describes the degree of consistency in the pairwise comparisons of a con- 
sidered sub-problem h. The Cl value expresses the relative deviation of the largest 
eigenvalue X* ^ of matrix A* from the number of included elements n* : 

A* — n* 

cf = V h (3) 

To get a notion of the consistency of matrix A*, CI^ is related to the average consis- 
tency index of random matrices RI of the same size. The resulting measure is called 
the consistency ratio CR^, with CR^ = ^. In order to evaluate the degree of consis- 
tency for the entire hierarchy, the arithmetic mean of all consistency ratios ACR can 
be used (Saaty (1980)). 

The AHP hierarchy can be represented by an additive model according to multi- 
attribute value theory. In doing so, the part-worth utilities are determined by multi- 
plying the relative utility values of each sub-problem along the path, from the main 
goal to the respective attribute level. The attribute importances are calculated by mul- 
tiplying the relative utility values of the attribute categories with the relative utility 
values of the attributes with respect to the related category. Then, the overall utility 
of a product or concept stimulus is derived by summing up the part-worth utilities of 
all attribute levels characterizing this alternative. 



3 Design of the empirical study 

The attributes and levels considered in the following empirical study were deter- 
mined by means of dual questioning technique. Repertory grid and laddering tech- 
niques were applied to construct an average hierarchically representation of the prod- 
uct evaluation problem (Scholz and Decker (2007)). Altogether, 200 respondents par- 
ticipated in these pre-studies. The resulting product description design (see Table 1) 
was used for both the AHP and the ACA survey. The latter was conducted according 
to the recommendations in Sawtooth Software’s recent ACA manual. 

Each respondent had to pass either the ACA or the AHP questionnaire to avoid 
learning effects and to keep the time needed to complete the questionnaire within 
acceptable limits. Neither ACA nor AHP provide a general measure of predictive 
validity, which is usually quantified by presenting holdout tasks. If the number of 
attributes to be considered in a product evaluation problem is high, the use of hold- 
out stimuli is regularly accompanied by the risk of information overload (Herrmann 
et al. (2005)). The relevant set of attributes was determined for each respondent in- 
dividually to create a realistic choice setting. Each respondent was shown reduced 
product stimuli consisting of his or her six most important attributes. Accordingly, 
the predictive validity was measured by means of a computer administered holdout 
task similar to the one proposed by Herrmann et al. (2005). 
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Choice tasks including three holdout stimuli were presented to each respondent 
after having completed the preference measurement task. One of these alternatives 
was the best option available for the respective respondent (based on an online es- 
timation of individual part- worth utilities carried out during the interview). The two 
other stimuli were slight modifications of this best alternative. Each one was gen- 
erated by randomly changing three attribute levels from the most preferred to the 
second or third most preferred level. 

In the last part of the online questionnaire, each respondent was faced with his 
or her individual profile of attribute importance estimates. In this regard, the corre- 
sponding question ‘'Does the generated profile refiect your notion of attribute impor- 
tance?" had to be answered on a 9-point rating scale ranging from “poor" (= 1) to 
“excellent" (= 9). 

The respondents were invited to participate in the survey via a large public e-mail 
directory. For practical reasons we sent 50 % more invitations to the ACA than to the 
AHP survey. We obtained 380 fully completed questionnaires for ACA and 204 for 
AHP. In both cases, more than 40 % of those who entered the online interview also 
completed it. Chi-square homogeneity tests show that both samples are structurally 
identical with respect to socio-demographic variables. 



4 Results 

The data quality of our samples was assessed by measuring the consistency of the 
preference evaluation tasks. To evaluate the degree of consistency for the entire hi- 
erarchy, ACR was used for AHP. In case of ACA the coefficient of determination 
measuring the goodness-of-fit of the preference model, was considered. According 
to both measures, namely ACR = . 17 and R? = .77, the internal validity of our study 
can be rated high. To come up with a fair comparison, we accepted all completed 
questionnaires and did not eliminate respondents from the samples on the basis of 
ACAs/? 2 or AHP’s AC/?. 

As a first step in our empirical investigation, we compared the resulting pref- 
erence structures on the aggregate level. We transformed the part-worth utilities of 
both methods such that they sum up to zero for all levels of each attribute to facilitate 
direct comparisons. The attribute importances were transformed in both cases such 
that they sum up to one for each respondent. Spearman’s rank correlation was used 
to contrast the convergent validity of AHP with ACA. Table 2 provides the attribute 
importances and the transformed part- worth utilities of both approaches. The differ- 
ences regarding the part-worth utilities are rather small. Although both methods are 
conceptually different, the obvious structural equality points to high convergent va- 
lidity. The rank correlation between AHP and ACA part-worth utilities equals .90. 

In contrast, there are substantial differences between the attribute importances of 
AHP and ACA on the aggregate level (r = —.08). To assess the factual quality of at- 
tribute importances, we verified the present results by considering previous empirical 
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studies in the field of tourism. In a recent study by Hamilton and Lau (2004) the ac- 
cess to the sea or lake was ranked second among the 10 attributes considered in this 
study. The importance of the corresponding attribute location in our study is higher 
for AHP than for ACA which favors the values provided by the former. Analogously, 
the attribute active sports (which corresponds to leisure activities in our study) was 
rated as very important by only 6 % of the respondents in a survey by Study Group 
“Vacation and Travelling" (FUR (2004)). On the other hand, the importance of the 
attribute relaxation, which is similar to outside facilities in our study, was highly ap- 
preciated. Insofar, the AHP results are in line with the FUR study by awarding high 
importance to outside facilities and lower importance to leisure activities. 

To find an appropriate external criterion that allows to measure the validity of 
the resulting individual attribute importances is difficult. We chose the respondents’ 
individual perceptions as an indicator and measured the adequacy of the importance 



Table 2. Average attribute importances and part-worth utilities 



Category 

Attribute 


ACA 

Importance 


Part-worths* 


AHP 

Importance 


Part-worths* 






One 


Two 




One 


Two 






Three 


Four 




Three 


Four 






Five 


Six 




Five 


Six 


Vacation spot 














Sightseeing offers 


9.51 


.24 (1) 


.09 (2) 


6.19 


.21 (1) 


.02 (2) 






-.33 (3) 






-.23 (3) 




Security concerns 


10.87 


.36(1) 


.06 (2) 


11.86 


.53(1) 


-.09 (2) 






-.42 (3) 






-.44 (3) 




Climate 


11.45 


.01 (2) 


.36(1) 


9.69 


-.13(2) 


.39(1) 






-.37 (3) 






-.26 (3) 




Beach 


9.83 


-.10(2) 


.35(1) 


5.56 


-.09 (2) 


.26(1) 






-.25 (3) 






-.17(3) 




Hotel services 














Leisure activities 


11.72 


-.20 (6) 


-.02 (2) 


7.52 


-.04 (6) 


.02 (2) 






.04 (2) 


.20(1) 




-.01 (4) 


.01 (3) 






.01 (3) 


-.03 (5) 




.03 (1) 


-.01 (5) 


Furnishing 


10.15 


.10(1) 


-.13 (4) 


12.49 


.08 (1) 


-.09 (4) 






-.03 (3) 


.06 (2) 




-.01 (3) 


.03 (2) 


Catering 


12.17 


-.19(5) 


.03 (3) 


13.29 


-.07 (5) 


-.01 (3) 






.12(1) 


-.07 (4) 




.02 (2) 


-.04 (4) 






.10(2) 






.10(1) 




Hotel facilities 














Location 


7.78 


-.24 (2) 


.24(1) 


12.84 


-.32 (2) 


.32(1) 


Type of building 


9.09 


.08 (2) 


-.22 (3) 


8.36 


-.03 (2) 


-.12(3) 






.14(1) 






.15(1) 




Outside facilities 


7.40 


.25 (1) 


.00 (2) 


12.11 


.28(1) 


-.09 (2) 






-.25 (3) 






-.19(3) 





(* The ranking of attribute levels is depicted in brackets.) 
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estimates in the last part of the questioning by means of a 9-point rating scale ques- 
tion (see Section 3). Here, AHP was judged significantly better (p < .01) with an 
average value of 7.3 compared to AC A with 6.68. This suggests that AHP yields 
higher congruence with the individual perceptions than ACA. But since it is not 
clear to what extent respondents are really aware of their attribute importances, the 
explanatory power of this indicator has not been fully established. 

The predictive accuracy of both methods was checked by comparing the overall 
utilities of the holdout stimuli with the actual choice in the presented holdout task 
as explained in Section 3. Both methods were evaluated by two measures: The first 
choice hit rate equals the frequency with which a method correctly predicts the vaca- 
tion package chosen by the respondents. Here, AHP significantly outperforms ACA 
with 83.33 % against 60.78 % {p < .01). The overall hit rate indicates how often a 
method correctly predicts the rank order of the three holdout stimuli as stated by the 
respondents. Taking into account that the respondents had to rank alternatives of their 
evoked sets (i.e. the best and two “near-best" alternatives) the predictive accuracy of 
both approaches is definitely satisfying. Again, AHP significantly outperforms ACA 
with an overall hit rate equal to 63.42 % compared to 43.94 % for the latter {p <^01). 
For comparison: random prediction would lead to an overall hit rate equal to .16. All 
in all, AHP shows a significantly higher predictive accuracy for products belonging 
to the evoked set of the respondents than ACA. 



5 Conclusions and outlook 

This paper presents an online implementation of AHP for consumer preference mea- 
surement in the case of products with larger numbers of attributes. As a first bench- 
mark, we empirically compared AHP with Sawtooth Software’s ACA in the domain 
of summer vacation packages. While both methods yielded high values for internal 
and convergent validity, AHP significantly outperforms ACA regarding individually 
tailored holdout tasks generated from the respondents’ evoked sets. The results sug- 
gest AHP as a promising method for preference-driven new product development. 

Further empirical investigations are required to support the results presented here. 
These should include additional preference measurement approaches, such as SE or 
Bridging CA (Green and Srinivasan (1990)). Moreover, the implication of differ- 
ent hierarchies have not been fully understood in AHP research (Poyhonen et al. 
(2001)). While we conducted extensive pre-studies to come up with an expedient 
hierarchy, market researchers should be very carefully when structuring their deci- 
sion problems hierarchically. The application of simple 3-level hierarchies focusing 
on the main goal, attributes and levels only, and leaving out higher-level attribute 
categories might be beneficial. These hierarchies would also be reasonable when the 
product evaluation problem cannot be broken down into ‘natural’ groups of attribute 
categories. 
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Abstract. The rank based multivariate exponentially weighted moving average (rMEWMA) 
control chart was proposed by Messaoud et al. (2005). It is a generalization, using the data 
depth notion, of the nonparametric EWMA control chart for individual observations proposed 
by Hackl and Ledolter (1992). The authors approximated its asymptotic in-control perfor- 
mance using an integral equation and assuming that a sufficiently large reference sample 
is available. The actual paper studies the effect of the use of reference samples of limited 
amount of observations on the in-control and out-of-control performances of the proposed 
control chart. Eurthermore, general recommendations for the required reference sample sizes 
are given so that the in-control and out-of-control performances of the rMEWMA control 
chart approach their asymptotic counterparts. 



1 Introduction 

In practice, rMEWMA control charts are used with reference samples of limited 
amount of observations. In this case, the estimation effect may affect its in-control 
and out-of-control performances. This issue is discussed in this paper based on the 
results of Messaoud (2006). In section 2, we review the data depth notion. The 
rMEWMA control chart is introduced in section 3. The effect of the use of refer- 
ence samples of limited amount of observations on its in-control and out-of-control 
performances is studied in section 4. 



2 Data depth 

Data depth measures how deep (or central) a given point X € is with respect to 
(w.r.t.) a probability distribution F or w.r.t. a given data cloud S = {Yi, ..., Ym). 
There are several measures for the depth of the observations, such as Mahalanobis 
depth, simplicial depth, half-space depth, and majority depth of Singh, see Liu et al. 
(1999). In this work, only the Mahalanobis depth is considered, see section 4.1. 
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The Mahalanobis depth 



The Mahalanobis depth of a given point X € w.r.t. F is defined by 



MD{F,X) 



1 

i + (x-^Fyxp\x-pF) ’ 



where /jf and are the mean vector and covariance matrix of F, respectively. The 
sample version of MD is obtained by replacing /^f and Xf with their sample esti- 
mates. 



3 The proposed rMEWMA control chart 



Let X, = . . . ,Xd.tY denote the x 1 vector of quality characteristic measure- 
ments taken from a process at the time point where xj^t, j = > d, is the 

observation on variate j at time t. Assume that the successive X, are independent 
and identically distributed random vectors. Assume that m> I independent random 
observations {Xi, ..., X^} from an in-control process are available. That is, the 
rMEWMA monitoring procedure starts at time t = m. 

Let RS = {Xj_m+i> • ■ ■ , Xj ) denote a reference sample comprised of the m most 
recent observations taken from the process at time t>m.lt is used to decide whether 
or not the process is still in control at time t. The main idea of the proposed rMEWMA 
control chart is to represent each multivariate observation of the reference sample by 
its corresponding data depth. Thus, the depths D{RS,Xi), i = t — m+ 1, . t, are 
calculated w.r.t. RS. 

Now, the same principles proposed by Hackl and Ledolter (1992) are used to con- 
struct the rMEWMA control chart. Let 0 denote the sequential rank of D{RS,Xt) 
among D{RS,Xt-m+i), ■■■, D{RS,Xt). It is given by 

t 

e; = l+ ^ l(D{RS,Xt)>D{RS,Xi)), (1) 

i=t—m+\ 



where /(.) is the indicator function. It is assumed that tied data depth measures are 
not observed. Thus, Q*^ is uniformly distributed on the m points {1,2, . . . ,ni}. The 
standardized sequential rank Qf is given by 



Q7 = 



2 

m 





( 2 ) 



It is uniformly distributed on the m points {1/wr— 1, ..., 1 — 1/wi} with 

mean /rgm = 0 and variance Ggm = , see Hackl and Ledolter (1992). 

The control statistic 7) is the EWMA of standardized sequential ranks. It is com- 
puted as follows 



7)=min{B,(l-X)7;_i+Xer}, 



( 3 ) 
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t=\,2 , where 0 < X < 1 is a smoothing parameter, Z? > 0 is a reflecting boundary 
and To = M is a starting value. The process is considered in-control as long as > h, 
where /t < 0 is a lower control limit (h<u< B). Note that the lower-sided rMEWMA 
is considered because the statistic Qf is higher “the better”. Indeed, a high value of 
QP means that observation X, is deep w.r.t. RS which refers to a process improve- 
ment. A reflecting boundary is included to prevent the rMEWMA control statistic 
from drifting to one side indefinitely. It is known that EWMA schemes can suffer 
from an “inertia problem” when there is a process change some time after beginning 
of monitoring. That is, an EWMA control statistic can have wandered away from a 
center line in a direction opposite to that of a shift that occurs some time after the 
start of monitoring. In this unhappy circumstance, an EWMA scheme can take long 
time to signal. For further details about the design of rMEWMA control charts, see 
Messaoud (2006). 

In practice when measurements or other numerical observations are taken, it is 
often that two or more observations are tied. The most common approach to this 
problem is to assign to each observation in a tied set the midrank, that is, the average 
of the ranks reserved for the observations in the tied set. 

The statistical design of the rMEWMA control chart refers to choices of com- 
binations of X, h, B and m. It ensures the chart performance meets certain statistical 
criteria. These criteria are often based on aspects of the run length distribution of 
the control chart. The run length (RL) of a control chart is a random variable that 
represents the number of plotted statistics until a signal occurs. The most common 
measure of control chart performance is the expected value of the run length; i.e. 
the average run length (ARE). The ARE should be large when the process is sta- 
tistically in-control (in-control ARE) and small when a shift has occurred (out-of- 
control ARE). However, conclusions based on in-control and out-of-control ARE 
alone can be misleading. Knowledge of the in-control and out-of-control RE dis- 
tributions would provide a comprehensive understanding of the in-control and out- 
of-control control chart performances. For example, the lower percentiles of the in- 
control and out-of-control RE distributions give information about the early false 
alarm rates and the ability to quickly detect an out-of-control condition of a control 
chart. 

The integral equation (4) is used to approximate the asymptotic in-control ARE 
of rMEWMA control charts 

L{u) = I + L{B)Pi ^((1 + W 

where f{q) is the probability density of the uniform distribution. In this approxima- 
tion, it is assumed that a sufficiently large reference sample is available and the slight 
dependence among successive ranks Q™ is ignored. 
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4 Effect of the reference sample size on rMEWMA control charts 
performance 

4.1 Simulation Study 

Messaoud (2006) conducted a simulation study in order to examine the estimation 
effect on the desired in-control and out-of-control run length (RL) performances of 
rMEWMA control charts. A desired in-control and out-of-control RL performances 
mean that the empirical in-control and out-of-control RL distrihutions approach their 
asymptotic counterpart. As mentioned, only the Mahalanobis rMEWMA control 
charts are considered. 

Lor the simulation, random independent observations {X;} are generated from a 
bivariate normal distribution with mean vector /Tq = (0,0)' and variance covariance 
matrix Note that due to the nonparametric nature of rMEWMA control charts, 
the normality of the observations is not required and any other distribution could be 
used. The shift scenario in the mean vector from /Tq to /rj is considered to represent 
the out-of-control process. Its magnitude 5 is given by 

(5) 

Other out-of-control scenarios are not considered, for example a change in the in- 
control covariance matrix Xx. Note that in the context of multivariate normality, 5 is 
called the noncentrality parameter. 

Since the multivariate normal distribution is elliptically symmetrical and the Ma- 
halanobis depth is affine invariant, see Liu et al. (1999), the Mahalanobis rMEWMA 
control charts are directionally Invariant. That is, their out-of-control ARL perfor- 
mance depends on a shift in the process mean vector n only through the value of 5. 
Thus, without any loss of generality, the shift is hxed in the direction of Ci = (1,0)' 
and the variance covariance matrix Xx is taken to be the identity matrix I. Lor more 
details about the simulation study, see Messaoud (2006). 

4.2 Simulation results 

Messaoud (2006) considered the four Mahalanobis rMEWMA control charts with 
X = 0.05, 0.1, 0.2 and 0.3. In this paper, only the Mahalanobis rMEWMA control 
chart with X = 0.3, h = —0.551 and B = —h is studied in detail. 

Table 1 shows summary statistics of the in-control (5 = 0) and out-of-control 
(5 ^ 0) run length (RL) distributions of the Mahalanobis rMEWMA control charts 
based on reference samples of size m = 10, 28, 100, 200, 500, 1000 and 10000 
(m « o°). Note that the desired in-control (5 = 0) ARL performance is obtained using 
m = 28. This motivates this choice. SDRL is the standard deviation of the run length. 
Q(.IO), 2(.50), and Q{.90) are respectively the 10th, 50th, and 90th percentiles of 
the in-control and out-of-control RL distributions. In the following, ARLq and ARLi 
are used to represent the in-control (5 = 0) and out-of-control (for any 6^0) ARL, 
respectively. Similarly, Qo{q) and Qi{q) refer to the gth percentile of the In-control 
(5 = 0) and out-of-control (for any 5^0) RL distributions, respectively. Note that 
2o(.50) and 2i(.50) are respectively the in-control and out-of-control median RL. 
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Table 1. In-control (5 = 0) and out-of-control (§ ^ 0) run length properties of Mahalanobis 
rMEWMA control charts with ~k = 0.3 and h = —0.551 based on reference samples of size m 



Shift Magnitude 5 



m 




0.0 


0.5 


1.0 


1.5 


2.0 


2.5 


3.0 




ARL 


342.18 


341.42 


339.42 


334.52 


326.63 


316.80 


306.92 




SDRL 


338.74 


338.62 


338.77 


338.89 


338.54 


337.80 


337.35 


10 


Q(.IO) 


38 


37 


35 


30 


22 


12 


5 




Q(.50) 


238 


237 


236 


230 


222 


212 


201 




e(.90) 


786 


785 


784 


779 


771 


759 


749 




ARL 


199.77 


196.56 


183.44 


151.96 


105.25 


59.10 


28.04 




SDRL 


193.98 


193.91 


193.13 


187.73 


169.44 


133.97 


93.43 


28 


G(.iO) 


25 


21 


9 


5 


3 


3 


3 




Q(.50) 


140 


137 


124 


86 


12 


5 


4 




e(.90) 


456 


452 


438 


399 


325 


205 


44 




ARL 


185.15 


170.90 


118.21 


43.15 


9.17 


4.32 


3.47 




SDRL 


176.11 


175.05 


162.56 


104.09 


31.47 


4.67 


0.91 


100 


Q(.IO) 


24 


15 


6 


4 


3 


3 


3 




Q(.50) 


133 


118 


40 


10 


5 


4 


3 




Q(.90) 


414 


398 


329 


124 


12 


6 


5 




ARL 


188.05 


160.85 


76.68 


15.85 


5.88 


4.02 


3.36 




SDRL 


177.44 


173.60 


131.29 


40.39 


4.87 


1.49 


0.75 


200 


Q(.IO) 


23 


14 


6 


4 


3 


3 


3 




Q(.50) 


138 


99 


25 


8 


5 


3 


3 




e(.90) 


420 


389 


234 


26 


10 


6 


4 




ARL 


196.22 


138.36 


38.11 


10.40 


5.47 


3.92 


3.32 




SDRL 


185.11 


163.28 


63.83 


8.37 


2.85 


1.34 


0.69 


500 


e(.io) 


24 


14 


6 


4 


3 


3 


3 




Q(.50) 


141 


79 


21 


8 


5 


3 


3 




Q(.90) 


445 


350 


78 


20 


9 


6 


4 




ARL 


199.35 


119.63 


29.83 


9.91 


5.38 


3.88 


3.31 




SDRL 


192.86 


140.93 


32.33 


7.28 


2.73 


1.31 


0.67 


1000 


Q(.IO) 


24 


13 


6 


4 


3 


3 


3 




Q(.50) 


141 


73 


20 


8 


5 


3 


3 




Q(.90) 


455 


280 


65 


19 


9 


6 


4 




ARL 


201.00 


99.02 


26.16 


9.58 


5.29 


3.85 


3.29 




SDRL 


197.71 


98.23 


23.12 


6.66 


2.61 


1.26 


0.65 


oo 


G(.IO) 


24 


13 


6 


4 


3 


3 


3 




G(.50) 


141 


68 


19 


8 


5 


3 


3 




G(.90) 


459 


223 


56 


18 


9 


5 


4 


NOTE: 


ARL = 


average run length 













SDRL = standard deviation of run length distribution 
Q{q) = ^th percentile of run length distribution 



Performance of rMEWMA control charts based on small reference samples 

Table 1 shows that the ARLq performance of the rMEWMA control chart is ap- 
proximately equal to the desired ARLq of 200 using m = 28. Moreover, Qo( lO), 
Qo(.50) and 2o(-90) are approximately equal to their asymptotic counterparts. How- 
ever, Table 1 shows that the ARLi, 2i(.50), and Qi{.90) values of this control chart 
are much larger than the ARLi, 2i(.50) and Qi{.90) values of rMEWMA control 
charts with larger values of m. Therefore, even though that using relatively small ref- 
erence samples achieves the desired in-control RE performance, this choice reduces 
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considerably the rMEWMA control charts ability to quickly detect an out-of-control 
condition. 

Performance of rMEWMA control charts based on moderate and large 
reference samples 

In the following, the rMEWMA control charts based on moderate and large reference 
samples are considered, i.e., m = 100, 200, 500, and 1000. 

In-Control case (5 = 0) 

Table 1 shows that the ARLq values of the rMEWMA control charts based on refer- 
ence samples of size m = 100, 200, 500 and 1000 are shorter than the desired ARLq 
of 200. That is, these control charts produce more false alarms than expected. How- 
ever, interpretation based on the ARLo values alone can be misleading. The 2o(-90) 
values given in Table 1 indicate that the larger percentiles of the in-control RL dis- 
tributions affect the ARLo values. 

For example, consider the rMEWMA control chart with m = 200. Its ARLq 
value is 6.44% shorter than its asymptotic value, see Table 1. Table 1 shows that 
the 2o(-10) value is approximately equal to its asymptotic value of 24. The 2o(-50) 
value is equal to 138. It is slightly shorter than its asymptotic value of 141. That is, the 
control chart produce in average a false alarm within 138 observations with a prob- 
ability of 0.5 and within 141 observations with the same probability when m « <=°. 
Thus, the control chart does not suffer from the problem of early false alarms. How- 
ever, the 2o(-90) value is equal to 420. It is much shorter than its asymptotic value 
of 459. This implies that the larger percentiles affect the ARLq value. 

Now we will focus on the probabilities of the occurrence of early false alarms. 
As mentioned, these probabilities are reflected in the lower percentiles of the in- 
control RL distributions. The 5th, 10th, 20th, 30th, 40th and 50th percentiles of the 
in-control RL distributions of the rMEWMA control charts with reference samples 
of size m = 100, 200, 500 and 1000 are nearly the same as their asymptotic values, 
see Messaoud (2006). Only the Qo(.40) and 2o(-50) values of the rMEWMA control 
charts with 100 <m< 200 are slightly shorter than their asymptotic values. 

Therefore, we can conclude that the observed decreases in the ARLq values in 
Table 1 are caused by the shorter values of the larger percentiles. Practitioners should 
not fear for the problem of early false alarms when reference samples of size m > 100 
observations are used. 

Out-of-control case (5 ^ 0) 

Table 1 shows that the ARLi values of the rMEWMA control charts are larger than 
their asymptotic counterparts. However, interpretation based on the ARLi values 
alone may lead to inaccurate conclusions. Thus, the lower percentiles and the median 
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of the out-of-control RL distributions are investigated. They provide useful informa- 
tion about the ability of rMEWMA control charts to quickly detect an out-of-control 
condition. 

First, we investigate the out-of-control RL performance of the rMEWMA control 
charts for shifts of magnitude 5 > 1.5. Table 1 shows that the 2i(.10) and 2i(.50) 
values are nearly the same as their asymptotic values. However, the 2i(.90) values 
are larger than their asymptotic values. That is, the ARLi values are affected by 
some long runs. For example, consider the rMEWMA control chart with reference 
sample of size m = 100. Its ARLi value for detecting a shift of magnitude 6 = 1.5 
is 350.42% larger than its asymptotic value of 9.58. Table 1 shows that the 2i(.10) 
and Qi(.50) values are nearly the same as their asymptotic counterparts. However, 
the ARLi value is affected by some long runs. The Qi(.90) value is equal to 124. 
It is much larger than its asymptotic value of 18. Therefore, we can conclude that 
the estimation effect does not affect the ability of the rMEWMA control chart with 
X = 0.3 to quickly detect shifts of magnitude 5 > 1.5 when reference samples of size 
m > 100 are used. 

Now we investigate the out-of-control RL performance of the rMEWMA control 
charts for shifts of magnitude 5 = 0.5 and 1.0. The lower percentiles of the out- 
of-control RL distributions of rMEWMA control charts with 100 < m < 200 are 
larger than their asymptotic values, see Messaoud (2006). That is, the estimation 
effect affects the sensitivity of these control charts to react to shifts of magnitude 
5 < 1.0. Eor rMEWMA control charts with 500 <m< 1000, the lower percentiles 
of the out-of-control RL distribution are nearly the same or slightly larger than the 
asymptotic values. Therefore, we can conclude that using reference samples of size 
m > 500 ensures that the rMEWMA control chart with X = 0.3 perform like one 
with sufficiently large reference samples, i.e., m « <=°. Its ability to quickly detect an 
out-of-control condition is not affected. 

Sample size requirements 

Note that similar results are observed for rMEWMA control charts with X = 0.05, 
0.1 and 0.2, see Messaoud (2006). Therefore, we can conclude that using large ref- 
erence samples of size m > 500 will reduce the estimation effect on the in-control 
and out-of-control RL performances of rMEWMA control charts. The early false 
alarms produced by the rMEWMA control charts and the early detection of out- 
of-control conditions are mainly used to evaluate their in-control and out-of control 
performances. The reader should be aware that the sample size recommendation may 
differ for other out-of-control scenarios. Eor example, a shift in the in-control covari- 
ance matrix. 



5 Conclusion 

In this work, the estimation effect on the performance of the rMEWMA control 
chart is studied. General recommendations for the required reference sample sizes 
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are given so that the in-control and out-of-control RL performances of rMEWMA 
control chart approach their asymptotic counterparts. As noted, only the shift sce- 
nario in the mean vector is considered to represent the out-of-control process. The 
required large reference samples of size m > 500 observations should not be a prob- 
lem for the applications of rMEWMA monitoring procedures. Nowadays, advances 
in data collection activities as well as the computational power of digital computers 
have increased the available data sets in many industrial processes. However, practi- 
tioners should not neglect the estimation effect on the in-control and out-of-control 
performances of the rMEWMA control charts if for some industrial applications 
forming large reference samples might be problematic. 
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Abstract. With increasing duration of a relationship the probability that customers experi- 
ence specifically negative interaction episodes but also very positive interaction episodes in- 
creases. A key question that has not been investigated in the literature concerns the impact of 
these extreme interaction experiences, referred to as Critical Incidents (CIs) on the quality and 
strength of consumer-firm relationships. In a sample of customers in a service setting we first 
demonstrate that indeed the number of negative (positive) CIs possess a negative (positive) 
and asymmetric impact on measures of relationship quality (satisfaction, trust) and measure 
of relationship strength (loyalty). Second using a MIMIC approach we further shed light on 
the question which particular incidents are really critical for a customer firm relationship and 
which have to be prevented with priority. 



1 Introduction 

Customers’ interaction experiences with a service provider vary widely, ranging from 
remarkable positive to remarkable negative experiences. Especially extreme interac- 
tion experiences in the service process might significantly influence the customer- 
firm relationship. These extreme interaction experiences are therefore referred to as 
Critical Incidents (CIs). Several methods to record and measure CIs exist. The most 
commonly used is the Critical Incident Technique (CIT), which records CIs through 
interviews. The method’s major advantage in comparison to e.g. traditional attribute 
based measures of satisfaction lies in the collection of experiences from the respon- 
dents’ perspective. Thus service quality perceptions are not measured with ratings on 
predefined attributes but captured in the customer’s own words (Edvardsson, 1992). 
An extension of the CIT is the Sequential Incident Technique (SIT). Eollowing the 
process of service delivery and consumption, CIs as well as usual events are col- 
lected to inform about crucial points within this process. Besides its costliness it re- 
mains unclear if this method is suitable for understanding customer satisfaction and 
its’ application is limited to standardized processes (Stauss & Weinlich, 1997). Both 
methods assume that the collected CIs are indeed critical for the customer-company 
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relationship but do not assess their criticality. This shortcoming is alleviated with the 
Switching Path Analysis Technique (SPAT), which retrospectively asks customers 
for experiences accountable for their provider switch (Roos, 2002). Although this 
method clearly collects incidents that are truly critical, it is only applicable to re- 
spondents who have just switched - a strong limitation for recruiting respondents 
and a reason for the rare application of SPAT. 

In contrast more than 140 studies appeared in the marketing literature apply- 
ing the conventional CIT following its introduction by Bitner, Booms, and Tetreault 
(1990) to the marketing community. Even though the CIT is the most applicable 
method and also the most widely applied technique, current CIT studies suffer from 
severe methodological weaknesses. A current review by Gremler (2004) on the usage 
of the CIT in marketing highlights frequent shortcomings of its existing applications. 
Specifically multiple incidents occurring in the same context and multiple occurrence 
of the same Cl are generally not collected. Besides, many studies restricted their col- 
lection to negative CIs and one Cl per respondent. An alarming 38% of these studies 
do not report any type of reliability statistic. Furthermore the relevance of CIs for a 
customer relationship is assumed but not assessed, since CIs are not linked to key 
relationship constructs such as satisfaction, trust and loyalty. Following Gremler’s 
(2004) call we conducted a study without these mentioned shortcomings. In contrast 
to existing studies rather than merely assuming we will explicitly model the impact 
of the number of experienced CIs on relationship outcomes notably trust and loyalty. 
Further we will apply a Multiple Indicators and Multiple Causes (MIMIC) approach 
to model the impact of the category of the experienced CIs on relationship outcomes 
in order to understand which interaction episodes were particularly damaging or sup- 
porting for a marketing relationship. 



2 Hypotheses 

During their relationship with a service provider some customers might constantly 
experience encounters characterized by expected employee behaviors and reactions. 
In addition to these neutral encounters, other customers experience remarkably de- 
lighting or upsetting interactions episodes. These unexpected CIs are a source of 
dis/satisfaction (Bitner et ak, 1990, p. 83), and are assumed to impact the overall 
evaluation of the service. The negative effect of experiencing one negative Cl versus 
none on satisfaction has already been confirmed (Odekerken-Schroder et ak, 2000). 
However, explicit tests regarding the influence of the number of experienced CIs 
and their valence (positive / negative) on overall satisfaction are lacking. The over- 
all evaluation of the service is based on the quality of regular service performances 
as well as extreme positive and negative experiences. Therefore, we argue that the 
number of prior positive (negative) CIs has a positive (negative) influence on service 
satisfaction. We therefore propose: 

H\: The number of positive critical incidents impacts positively on satisfaction. 

H2-' The number of negative critical incidents impacts negatively on satisfaction. 
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CIs possess a high diagnosticity for drawing inferences regarding the partner’s in- 
tentionality and disposition and therefore the status of the relationship itself (Ybarra 
& Stephan, 1999; Fiske, 1980). Even though highly counterintuitive, the only present 
study nullifies an effect of CIs on trust (Odekerken-Schroder et ak, 2000). Still CIs 
are exactly those “moments of truth” that relationship partner’s use to make infer- 
ences about the intentionality of the relationship other and can therefore be either 
trust building or trust destroying. We therefore propose that: 

Ht,: The number of positive critical incidents impacts positively on trust. 

H4: The number of negative critical incidents impacts negatively on trust. 

Building on Kahneman and Tversky’s prospect theory (1979) a broad body of re- 
search has demonstrated the consistent asymmetrical impact of negative information, 
attributes, and events (Baumeister et ak, 2001). CIs are by definition clearly above or 
below a neutral reference point, thus are either perceived to be very positive or very 
negative. In accordance with findings on the asymmetric impact of negative events 
in the psychological literature (e.g. Taylor, 1991), we propose: 

H5: The number of negative critical incidents impacts more strongly on satisfac- 
tion than the number of positive critical incidents. 

H(,: The number of negative critical incidents impacts more strongly on trust than 
the number of positive critical incidents. 

Numerous studies have demonstrated that both trust and satisfaction are determi- 
nants of repurchase intentions (Fornell et ak, 1996; Geyskens et ak, 1999; Morgan 
& Hunt, 1994; Szymanski & Henard, 2001). Because findings concerning these re- 
lationships are almost unanimous we do not further elaborate on them and propose: 

H-j: Trust in the service provider increases loyalty to the service. 

Satisfaction with the service increases loyalty to the service. 

Furthermore Singh and Sirdeshmukh (2000) argue that satisfactory experiences 
with a product or service are likely to reinforce expectations of competent perfor- 
mance in the future, whereas below-expectation performance are expected to reduce 
trust, thus we propose: 

Hg: Satisfaction with a focal product increases trust in the manufacturer. 

In the process of forming loyalty intentions previous studies have documented the 
mediating role of satisfaction (Szymanski & Henard, 2001) and trust (e.g. Morgan 
& Hunt, 1994). Building on these findings from the marketing literature, we pro- 
pose a similar mediating mechanism for the experienced number of CIs on loyalty 
intentions. 

Hiq: The impact of critical incidents on loyalty is fully mediated via trust and 
satisfaction. 

We need to note an important control variable neglected in the previous literature 
on CIs: mood. Cognitive processes are affected by mood (e.g. Forgas, 1995). In the 
context of the present study two mechanism merit attention. First, mood congruent 
recall (e.g. Eich et ak, 1994) implies that mood influences the frequency of recalled 
CIs, thus respondents in a positive (negative) mood will be more (less) likely to recall 
positive CIs. Second, affect infusion (e.g. Forgas, 1995) implies that mood affects 
judgments as satisfaction and trust, thus a positive mood will additionally enhance 
(reduce) these ratings. The interplay of these two effects might result in inflated cor- 
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relations between number of recalled incidents and measures of relationship quality. 
To exclude this possibility we control for respondents’ mood states. 



3 Method 

The empirical investigation was conducted with customers of repair departments of 
a major German car manufacturer. In 5 different outlets, in a metropolitan area, cus- 
tomers entering the store were asked to take part in a satisfaction survey. 207 cus- 
tomers agreed to participate and 191 of them had prior experiences with the service 
department. Accordingly 191 face-to-face interviews were conducted, consisting of a 
fully-structured and a semi-structured part. In the opening fully-structured part, par- 
ticipants were first asked to indicate their current mood, followed by questions cap- 
turing their satisfaction, trust and intention to stay loyal with the service department. 
Afterwards, in the semi-structured part the interview respondents were asked to talk 
about any Cl concerning the repair department. This part of the interview followed 
the widely used procedure of Bitner et al. (1990). Thus, the central question was: 
’’Think of your experiences with the repair department. Can you remember partic- 
ularly good or bad experiences during your contacts with the repair department?” . 
It has to be mentioned, that in contrast to most CIT studies, neither the number of 
CIs was restricted nor the valence of the critical incident as positive or negative. All 
customer reports were recorded. 



4 Results 

Customers that answered all relevant questions {N = 146) were eligible for hypothe- 
ses testing. These respondents were predominantly male (86%), on average 52 years 
old (standard deviation=12.2), and reported in total 185 CIs, of which 78 were posi- 
tive. The hypothesized models were estimated with LISREL 8.52 (Joreskog & Sor- 
bom, 2001). In the first step we tested our hypothesis concerning the impact of the 
number of CIs on relationship outcomes (satisfaction, trust, and loyalty). The basic 
model exhibits an excellent fit with: X^(29) = 20.45, p = .88, Root Mean Square 
Error of Approximation (RMSEA) = 0.0, and Comparative Eit Index (CFI) = 1.00. 
Overall, the model explains 31% of variance in satisfaction, 53% in trust, and 82% 
in loyalty. Except for the influence of positive CIs on trust (H^) (Y 21 = -02, p > .05), 
all hypotheses were supported. Results show that the number of experienced positive 
CIs per respondent possesses a positive impact on satisfaction with the repair de- 
partment (Hi) (Y 3 i = .26, p < .01). The number of experienced negative CIs impacts 
negatively on satisfaction with the repair department (H 2 ) (Y 32 = —-47, p < .01) and 
trust in the service provider (H/f) (Y 22 = —-36, p < .01). The expected influences of 
satisfaction on trust ( 7 / 9 ) (323 = -47, p < .01) and loyalty (H%) (3i3 = -51, p < .01), 
as well as trust on loyalty (Hi) (3i2 = -49, p < .01), were also confirmed. 

Next, we tested our hypotheses concerning the asymmetric impact of the number 
of positive and negative experienced CIs on satisfaction and trust. A model with 
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gamma coefficients constrained to be equal results in a non-significant decrease in 
model fit for satisfaction (X^(l) = -39) and a significant decrease in model fit for 
trust (x2(l) = 7.4,/? < 0.01). Thus the results confirm the asymmetric impact of CIs 
on trust (//e), but not on satisfaction (Hs). Then, we tested whether the influence of 
CIs is fully mediated via satisfaction and trust following the approach of Baron and 
Kenny (1986). Both positive and negative CIs have a significant impact on loyalty, 
which drops to zero when the indirect effects via satisfaction and trust are included 
in the model. Thus the full mediation hypothesis is supported (Hio). 

In order to show that previous findings hold, when controlling for mood, mood 
was added to the model, allowing it to impact all constructs. The resulting model fits 
the data well (X^(64) = 82.87, p = .06, RMSEA = 0.04, and CFI — .99) and con- 
firms the need to control findings for respondents’ current mood. Mood significantly 
affects the number of negative CIs experienced (or better recalled) with Yi 5 = -36, 
p < .01, judgments of satisfaction (yi 3 = —.27, p < .01), and trust (yi 2 = —-29, p < 
.01). Only positive CIs (yi 4 = —.11, p > .05) were not influenced by respondents’ 
current mood. As expected, mood did not influence loyalty intentions (yi 1 = —.04, 
p > .05). Although mood showed the proposed influences, all previously reported 
findings hold while controlling for mood. 

After we have confirmed that the number of positive and negative CIs impacts 
customer-firm relationship constructs, we address the question which categories of 
CIs are especially critical for the relationship. Therefore the 185 reported events were 
coded by three independent judges. The first judge developed an initial classification 
scheme and subsequently assigned the CIs into this initial scheme. The second and 
third judges had been advised to question this classification scheme, while assigning 
all obtained 185 CIs into the scheme. The resulting intercoder reliability was accept- 
able as assessed by various indices: percentage of agreement. .80, Cohen’s Kappa: 
.76, and Perreault und Leigh: .11 . Disagreement regarding categories and assign- 
ment of individual CIs to categories was resolved through a discussion of the unclear 
incidents with an expert from the automotive industry. The classification process re- 
vealed 7 negative Cl categories (e.g. low speed of service) and seven positive Cl 
categories. Since three categories had been experienced by less than 3% of the cus- 
tomers, they were excluded from the analysis, thus 1 1 categories remained eligible 
for the analyses (6 negative, 5 positive). 

Instead of testing a model with the aggregate number of experienced CIs, the 
different Cl categories experienced by the respondents were included in a MIMIC 
model. Basically each respondent can be described with a vector of zeros and ones in- 
dicating which particular incident type he has experienced in his relationship with the 
service department of his dealerships. These binary incident category variables are 
then related to relationship outcomes with a MIMIC approach, an approach tailored 
to deal with dichotomous independent variables (Bollen, 1989). The basis for model 
estimation here are, unlike ordinary Structural Equation Models, biserial correlations 
(Joreskog & Sorbom, 2001, p. 240). The proposed MIMIC model (see figure 1) ex- 
hibits an excellent fit with: 5(^(138) = 117.26, p = .90, RMSEA = 0.0, and CEI = 
1 .00. The model confirmed that not all CIs are indeed critical for the customer-firm 
relationship, and those Cl categories that are critical varied in the strength of their 
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impact. Breaking a promise and experiencing poor quality of repair work influence 
solely satisfaction ratings (Yae = —■'23, p < .01 andy 39 = —.32, p < .01), whereas CIs 
classified as showing no goodwill and restriction to basic service lowered customers 
trust in the service provider (Y 28 = — -14, p < .01 and Y 2 10 = — -14, p < .01). The 
incident category which should be primarily avoided is negative behaviors toward 
the customer, since it clearly has the most damaging impact on the customer-firm re- 
lationship, due to its dual influence on trust (Y 2 11 = —-21, p < .01) and satisfaction 
(Y 3 11 = —-26, p < .01). Interestingly, only one of the positive Cl categories (offer- 
ing additional service) impacts on satisfaction with the repair department (Y 33 = .23, 
p < .01) and none impacts on trust. 




Fig. 1. MIMIC model: Cl categories and their impact on relationship measures, significant 
path coefficients are depicted. 



5 Discussion 

Even though several papers in the marketing literature have raised the question 
whether and which incidents are really critical for a customer-firm relationship (Ed- 
vardsson & Strandvik, 2000) ours is the first study to explicitly address this ques- 
tion. In the present study, we conducted Cl interviews without restricting valence 
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and number of incidents reported, and assessed their impact on measures of rela- 
tionship quality. Our results confirm that positive and negative incidents possess a 
partially asymmetric impact on satisfaction and trust. Negative incidents have partic- 
ularly damaging effects on a relationship through their strong impact on trust (total 
causal effect: 0.58). These results are in stark contrast to Odekerken-Schroder et al.’s 
(2000) conclusion, that CIs do not play a significant role for developing trust. Fur- 
ther the damage inflicted by negative incidents can hardly be “healed” with very 
positive experiences, since the total causal effect of the number of positive incidents 
on trust is substantially smaller (0.12). Thus, management should clearly put empha- 
sis on avoiding negative interaction experiences. The employed MIMIC approach 
followed Gremler’s call (2004, p. 79) to “determine which events are truly critical to 
the long-term health of the customer-hrm relationship” and revealed which specific 
incident categories have a particular strong impact on relationship health and should 
be avoided with priority, such as negative behavior toward the customer. The col- 
lected vivid verbatim stories from the customer’s perspective provide very concrete 
information for managers and can be easily communicated to train customer-contact 
personnel (Zeithaml & Bitner, 2003; Stauss & Hentschel, 1992). For further studies, 
as pointed out by one of the reviewers, an alternative evaluation possibility would be 
to measure the experienced severity of the experienced Cl-categories instead of their 
mere occurrence. 
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Abstract. Despite the claim that satisfaction ratings are linked to actual repurchase behav- 
ior, the number of studies that actually relate satisfaction ratings to actual repurchase behav- 
ior is limited (Mittal and Kamakura 2001). Furthermore, in those studies that investigate the 
satisfaction-retention link customers have repeatedly been shown to defect even though they 
state to be highly satisfied. In a dramatic illustration of the problem Reichheld (1996) reports 
that while around 90% of industry customers report to be satisfied or even very satisfied, only 
between 30% to 40% actually repurchase. In this contribution, the relationship between satis- 
faction and retention was examined using a sample of 1493 business clients in the market of 
light transporters of a major European market. To examine heterogeneity in the satisfaction- 
relationship, a finite-mixture approach was chosen to model a mixed logistic regression. The 
subgroups found by the algorithm do differ with respect to the relationship between satisfac- 
tion and loyalty, as well as with respect to the exogenous variables. The resulting model allows 
us to shed more light on the role of the numerous moderating and interacting variables on the 
satisfaction-loyalty link in a business-to-business context. 



1 Introduction 

It has been one of the fundamental assumptions of relationship marketing theory that 
customer satisfaction has a positive impact on retention^ Satisfaction was supposed 
to be the only necessary and sufficient condition for attitudinal loyalty (stated repur- 
chase behavior) and the more manifest retention (actual repurchase behavior) and has 
been used as an indicator for future profits (Reichheld 1996, Bolton 1998). However, 
this seemingly undisputed relationship could not be fully confirmed by empirical 
studies (Gremler and Brown 1996). Further research points out that there can be a 
large gap between one-time satisfaction and repurchase behavior. Not always leads 
an intention to repurchase (i.e. the statement in a questionnaire) to an actual repur- 
chase and continuous repurchasing might exist without satisfaction because of mere 
price settings (see Soderlund and Vilgon 1999, Morwitz 1997). What is more, only 



* I.e. Anderson et al. (2004), Bolton (1998), Soderlund and Vilgon (1999). 
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a small number of studies has actually examined repurchase behavior instead of the 
easier to get repurchase intentions (Bolton 1998, Mittal and Kamakura 2001, Rust 
and Zahorik 1993). The tenor of these studies is that the link between satisfaction 
and retention is clearly weaker than the link between satisfaction and loyalty. 

Many other factors were discovered to have an influence on retention. Also more 
technical issues like common method variance, mere measurement effects or simply 
unclear definitions added to raise doubt on the importance and the exact magnitude of 
the contribution of satisfaction (Reichheld 1996, Soderlund/Vilgon 1999, Giese/Cote 
2000). Another reason for the weak relationship between satisfaction and retention 
is that it may not be a simple linear one, but one moderated by several different 
variables. Several studies have already studied the effect of moderating variables on 
the satisfaction-loyalty link (e.g. Homburg and Giering 2001). However, the great 
majority of empirical studies in this field measured repurchase intentions instead 
of objective repurchase behavior (Seiders et al. 2005). Thus, the conclusion from 
prior work is that considerable heterogeneity is present that might explain the often 
surprisingly weak overall relationship. 

An important contribution has been put forth by Mittal and Kamakura (2001). 
They combined the concepts of response biases and different thresholds^ into their 
model to capture individual differences between respondents. Based on their results 
they created a customer group where repurchase behavior was completely unrelated 
to levels of stated satisfaction. However, their approach fails to identify real existing 
groups that have a distinctive relationship between satisfaction and retention. For ex- 
ample, if model results show that older people have a lower threshold and thus repur- 
chase with a higher probability given a certain level of satisfaction, this is not the full 
story. Other factors, measured or unmeasured, might set off the age effect. In order 
to find groups with distinctive relationships between satisfaction and retention, we 
have explicitly chosen a finite-mixture^ approach, which results in a mixed-logistic 
regression setup. This model type basically consists of G logistic regressions - one 
for each latent group. This way, each case i is assigned to a group with a unique 
relation between the two constructs of interest. However, in a Bernoulli case like 
this (see McLachlan and Peel 2000, p.l63ff), identifiability is not given. The neces- 
sary and sufficient condition for identifiability is i)^ where m is the 

number of Bernoulli trials. For m = 1 no ML-regression can be estimated. But Foll- 
mann and Lambert (1991) prove theoretical identifiability of a special case of binary 
ML-regressions. Only the thresholds X are allowed to vary over the groups, while 
all remaining regression parameters are equal for all groups. According to Theorem 
2 of Follmann and Lambert (1991) theoretical identifiability then depends only on 
the maximal number of different values of one covariate given the values of all 
other covariates are held constant. The maximal number of components is then given 
by ( 7 ""“ = ^ ]\[max j^2— 1. Thus, the theorem restricts the choice of the variables. 



^ In our model thresholds are tolerance levels and can be conceived as the probability of 
repurchase given all other covariates are zero. 

^ For an overview on finite-mixture models, see McLachlan and Peel (2002) and the refer- 
ences therein. 
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but ultimately helps building a suitable model for the relationship under investiga- 
tion. In our hnal model we also included so-called concomitant covariate variables, 
which help to understand latent class membership and enhance interpretability of 
each group or class. This is achieved by using a multinomial regression of the latent 
class variable c on these variables x: 

fP-s+igM (P-i+igM 

P{Cai = \ \Xi) = = TT-j —i~. (1) 

1 + E/Ji 



Here a is a (G — 1) -dimensional vector of logit constants and T a(G— 1) x Q ma- 
trix of logit coefficients. The last group G serves as a standardizing reference group 
with ac = 0 and = 0. This results in a model of a mixed logistic regression with 
concomitant variables: 



with 



G 

P{yi = \ \xi) = '^P{cgi = l\xi)P{yi = l|cgi = l,^;), 
x=i 



P{yi 






1 -(- g^^g+Pg^i 



( 2 ) 



2 The Model 

To analyze the relationship between satisfaction and retention with a ML-regression, 
data is being used from a major European light truck market in a B2B environment. 
This data entails all major brands, which makes it possible to identify brand switch- 
ers and loyal customers. All respondents bought at least one light truck between 
two and four months before filling in the questionnaire. Out of all respondents who 
replied to all relevant questions only those were retained who bought the new truck 
as a replacement for their old one - resulting in 1493 observations. The satisfaction- 
retention link is now being operationalized in Mplus 4.0 using the response-bias- 
effect introduced by Mittal and Kamakura (2001), which enables us to use Theorem 
2 of Follmann and Lambert (1991). Following Paulssen and Birk (2006) only demo- 
graphic and by brand moderated demographic response-bias-effects are estimated in 
our model. The resulting equation for the latent satisfaction in logit is then: 

sat* = + ^ 2 sati * const + ^ 2 ,sati * aget + [34^ar,■ * brandi+ 

* const * brandt + ^(,satt * aget * brandt + e,-. 

The satisfaction-retention link for a latent class g can then be written as^: 



^ Here age stands for the standardized stated age, cons for consideration set and brand indi- 
cates a specific brand. 
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/’(Retention = yi\cgi = I, sat, cons, age ^ brand) = P{sat* > Xg) 

2 _j_ ^-Xg-\-sat* 

The latent class variable c is being regressed on the concomitant variables using a 
multinomial regression. As concomitant variables we used: Length of ownership of 
the replaced van (standardized), Ownership (self-employed 0, company 1), Brand of 
replaced van (other brands 0, specific "brand 1" 1), Consideration Set of other brands 
than the owned one (empty 0, at least one other brand 1) and Dealer (not involved in 
talks 0, involved 1). The model was estimated for several numbers of latent classes, 
with the theoretical maximum of classes being five. The fit indices for this model 
series can be found in table 1. All four ML-models possess a better fit than a simple 
logistic regression, but show a mixed picture. The AIC allows for a model with four 
classes and BIC allows for only one. To decide on the number of classes, the adjusted 
BIC was used, which allows for three classes^. This model was estimated using 500 
random starting values and 500 iterations as recommended by Muthen and Muthen 
(2006, p.327). The Log-Likelihood of the chosen model is not reproduced in only 
nine out of 100 sequences, which, according to Muthen and Muthen (2006, p.325), 
points clearly toward a global maximum. 



Table 1. Model Fit 



criterion 


Simple LR 


G= 1 


G = 2 


G = 3 


G = 4 


Log-Likelihood 


-971.280 


-928.104 


-902.727 


-888.346 


-873.472 


AIC 


1946.559 


1870.209 


1833.454 


1814.692 


1802.945 


BIC 


1957.176 


1907.369 


1907.774 


1915.554 


1951.584 


Adjusted BIC 


1950.823 


1885.132 


1863.300 


1855.196 


1862.636 


Entropy 


- 


- 


0.531 


0.563 


0.885 



Entropy for the chosen model is 0.563, which indicates modest separation of 
the classes. As can be clearly seen in table 2, the discriminatory power is mixed with 
class 2 being well separated (0.821), while classes 1 and 3 are not perfectly separable. 



Table 2. Miss-classification matrix 





1 


2 


3 


T 


0.762 


0.063 


0.175 


2 


0.179 


0.821 


0.000 


3 


0.328 


0.000 


0.672 



^ See Nylund et al. 2006. 
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The results of this model are shown in table 3. The thresholds of latent classes 
2 and 3 were fixed after the first models we used showed extreme values for them, 
resulting in a probability of repurchase of 0% respectively 100%. This means that 
for both groups repurchase probability is independent of the values of the covariates. 
In this way the algorithm eventually works as a filter and puts those respondents 
who repurchase or do not repurchase independent of their satisfaction into separate 
groups. Thus, the only unfixed threshold is 3.174 for latent class 1. This class has 
a weight of 49.4%, while class 2 has 27.7% and class 3 represents 29.9% of the 
respondents. The estimated value for (3i is 0.944 and is, like all other coefficients, 
significant on the 5% level. The value for 3i represents the main effect of satisfaction 
with the previous van in case all other covariates are zero. In this case the odds 
ratio for repurchasing the same brand is increased by gO .944 _ 2 . 57 ^ which means 
satisfaction has a positive effect on the odds of staying with the same brand versus 
buying another brand. The estimates 32 and 3s correspond to response-bias-effects 
in case, the brand is not the specific brand 1. Bofh estimates are significant, meaning 
that response bias is present. The interpretation of the beta-coefficients is similar as 
before in that all other covariates are assumed to be zero. When considering only 
respondents who had previously a van of brand 1 , that is brand = 1 , things change. 
The effect for age, given the consideration set is empty, becomes 0.147 — 0.131 = 
0.016 almost completely wiping out the influence of response bias. For the covariate 
consideration set results are analogous: Given a sample-average age the response— 
bias-effect for respondents who replaced a van by brand 1 collapses to —0.244 -f 
0.254 = 0.01. As to the multinomial logistic regression of the latent class variable 
c on the concomitant variables, class 3 has been chosen to be the reference class. 
The constants (Xg can be used to compute the probabilities of class membership for 
each respondent, who has an average length of ownership, who are self-employed, 
had not replaced a van of brand 1 , did not consider another brand and who were 
not involved in talks with the dealer. For this group class membership for class g is 
g«g/(l The probability of class membership in class 1 increases with 

increasing length of ownership. For low lengths of about one year, probability of 
membership is highest for class 3. However, probability of membership in class 2 
is hardly influenced by the length of ownership. Self-employed respondents have a 
probability of belonging to class 1 of more than 80% despite the non-significance of 
the owner variable. The influences on class membership for the other concomitant 
variables can be explained analogously. 

This model with three latent classes fits the data better than a simple linear re- 
gression of retention on satisfaction. The latter results in a marginal Nagelkerke-R^ 
of a bad 0.063. Now, if we look again at table 2, we might make a hard allocation 
of respondents to class 1 , despite the fact that separation of the classes is not perfect. 



® The probability of belonging to class 1 is 67.94%, for class 2 17.94% and for class 3 
14.12%. If the values of all concomitant variables are 1, the corresponding probabilities 
become 65.83%, 27% and 7.17%. If all other values of the concomitant variables are 0, a 
change from 0 to 1 in the brand variable, means that the odds to belong to class 1 compared 
to class 3 are just = 4.28. 
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Table 3. ML-regression results 



Variable 


Value 


Std.error 


Z-Statistic 


Response Bias for all classes 


Satisfaction 


0.944 


0.164 


5.749* 


Age* Satisfaction 


0.147 


0.046 


3.230* 


Consideration*Satisfaction 


-0.244 


0.113 


-2.157* 


Brand 1* Satisfaction 


-0.367 


0.100 


-3.673* 


Age*Brand 1* Satisfaction 


-0.131 


0.056 


-2.349* 


Consideration*Brand 1* Satisfaction 


0.254 


0.123 


2.075* 


Thresholds 


Threshold 


3.174 


0.963 


3.297* 


Threshold 


15.000 


- 


- 


7-3 Threshold 


-15.000 


- 


- 


Class 1 : Concomitant Variables 


Value 


Std.error 


Z-Statistic 


tti Constant 


1.573 


1.272 


1.237 


Length 


1.131 


0.432 


2.620* 


Owner 


-1.995 


1.130 


-1.765 


Brand 1 


1.455 


0.635 


2.292* 


Consideration 


0.141 


0.445 


0.316 


Dealer 


0.912 


0.746 


1.223 


Class 2: Concomitant Variables 


a.2 Constant 


0.243 


1.085 


0.224 


Length 


1.049 


0.443 


2.369* 


Owner 


-0.440 


1.009 


-0.436 


Brand 1 


0.275 


0.500 


0.549 


Consideration 


1.199 


0.299 


4.004* 


Dealer 


-0.213 


0.370 


-0.577 



* significant on the 5% level 



For class 1 we then arrive at a very good Nagelkerke-R^ value of 0.509. This means 
that the estimated model basically works as a filter leaving one group of respondents 
with a very strong relation between satisfaction and retention and two smaller groups 
with no relation at all. At this point the classes of the final model shall be interpreted. 
While average satisfaction ratings are essentially the same (6.77, 6.82 and 6.60 for 
classes 1 to 3), the relation between satisfaction and retention is very different. As 
indicated above, class 1 describes a filtered link between satisfaction with the re- 
placed van and retention. This class contains predominantly respondents who are 
self-employed, who were involved in talks with the dealer, who had a long length 
of ownership of their previous van and who drove a van of brand 1. In this class 
increasing satisfaction corresponds to a higher retention rate. This means in turn that 
marketing measures to increase retention via satisfaction campaigns are feasible for 
this group. Respondents of class 2 considered brands other than the brand of their 
replaced van prior to their purchase decision, which increased the number of choices 
they had for making the purchase decision. However, this class can also be consid- 
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ered as being influenced by other factors than were observed in our study. These 
factors might further explain why the retention rate is zero, although some members 
were in fact satisfied with their replaced van. It is easy to imagine that a large number 
of reasons, including pure coincidence, can lead to such a behavior. The third class, 
where respondents repurchase independent of their satisfaction, has at least one dis- 
tinctive feature. This class is dominated by very short lengths of ownership, which 
might be explained by the presence of leasing contracts. 



3 Discussion 

Previous studies have examined customer characteristics as moderating effects of the 
satisfaction-retention link. In order to further investigate this, we built on a model 
developed by Mittal and Kamakura (2001) that we expanded by including manu- 
facturer and company characteristics as additional moderating variables. Previous 
research did not fully investigate the moderating role of manufacturer/brand and 
company characteristics on the satisfaction retention link. Furthermore, by apply- 
ing a concomitant logit mixture approach we applied a new research method to this 
problem. Our results imply that similar to findings of Mittal and Kamakura (2001) 
customer groups exist where repurchase behavior is completely invariant to rated 
satisfaction. In the largest customer group a strong relationship between satisfaction 
and repurchase was present. Respondents in this group were self-employed, partici- 
pated in dealer talks and kept their commercial vehicles longer than members of the 
other classes. It is notable that for respondents who stated they were self-employed 
and participated in dealer talks the satisfaction-retention relationship is strong, indi- 
cating that those respondents had substantial leverage on decision making. That is, 
these respondents immediately punished bad performance of the incumbent brand 
and switched to other brands. For respondents that worked for companies other fac- 
tors (purchasing policies of the company, satisfaction from other members of the 
buying center) than their stated satisfaction may play a role. It also seems to be nec- 
essary that the respondent had a significant involvement in the buying process as in- 
dicated by his participation in dealer talks. This result also points to limitation of the 
often applied key informant approach - key informants have to be carefully screened. 
It does not suffice to ask whether they participate in certain business decisions. 
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Abstract. This paper will show, how the usage of an early-warning system, which has been 
developed and implemented for a big internet service provider, can detect customer equity po- 
tentials respectively risks and how to use this information to launch special customer treatment 
depending on strategic customer control dimensions in order to increase customer equity. The 
strategic customer control dimensions are: customer lifetime value, customer lifecycle and 
customer behaviour types. The development of the customer control dimensions depends on 
the availability of relevant customer data. Thus, from the huge amount of available customer 
data, relevant attributes have been selected. In order to reduce complexity and use standard- 
ised processes the raw-data is aggregated, for example into clusters. We will demonstrate by 
means of a real-life example the detection of spatial customer equity risks and the launch of 
customer equity increasing treatment using the early-warning system in interaction with the 
mentioned strategic customer control dimensions. 



1 Introduction^ 

In the b2c-sector a continuing increase in competition, growing customer expecta- 
tions, variety seeking and an erosion of margin can be observed. The solution of suc- 
cessful enterprises is a paradigm change from product-centred to customer-centred 
organisations combined with the long-run objective of customer equity (CE) max- 
imisation. 

In this paper we will show both theoretically and by means of a real-life example, 
the detection of CE risks at a very early stage as well as the launch of CE increasing 
treatment using the early-warning system (EWS) in interaction with the strategic cus- 
tomer control dimensions (SCCD). The SCCD are customer lifetime value (CLV), 
customer lifecycle (CLC) and customer behaviour types (CBT). 



* The content of this article originates from various projects. All projects have been exe- 
cuted under the leadership or substantial cooperation of the aCRM department, with the 
department head Klaus Thiel. 
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In section 2 we will briefly describe the SCCD CLV, CLC and CBT. Furthermore, 
we will show that space is another important dimension which has to be considered. 
Section 3 will deal with the methodology and the features of the EWS. Finally, we 
will show the operation of the EWS and its interaction with the SCCD hy means of 
a real-life example. 



2 Strategic customer control dimensions 

In order to regain competitiveness many enterprises in the h2c-sector conduct a 
strategic change from product-centred to customer-centred organisations (Peppers 
and Rogers (2004)). Due to the huge amount of customer data available in the b2c- 
sector, one big challenge for managers is to find the balance between standardised 
processes and individual customer value management. In order to follow the strate- 
gic guideline of our enterprise (change from a product-centred to a customer-centred 
organisation) and in order to reduce complexity in multivariate data- structures we de- 
rive relevant customer insight from the huge amount of data and aggregate the data 
to the relevant SCCD CLV, CLC and CBT. The mentioned SCCD are also discussed 
in relevant literature (e.g. Peppers and Rogers (2004), Gupta and Lehmann (2005), 
Stauss (2000)). We have implemented the SCCD in the customer data-warehouse 
(DWH) in order to use it for operative campaign management. Thanks to a re- 
duced complexity in data structures we can mainly use standardised processes for 
customer interaction. All processes and analyses are aligned and approved with the 
data-protection department and are conform with national and supranational data- 
protection guidelines (Steckler and Pepels (2006)). 

2.1 Customer lifetime value 

On the one hand, the CLV is a target and controlling variable. The total sum of CLV 
based on all customers, the so-called CE has to be maximised and is permanently 
controlled in the course of time. On the other hand, the CLV is also applied as a 
treatment variable. This means, that certain types of treatment in the customer value 
management framework are launched depending on the CLV. 

In order to enable customer value management, we have developed and imple- 
mented the following CLV formula according to Gupta and Lehmann (2005): 



CLVr = invr 



U 



c=l,2,...,12;^= l,2,...,K;r = l,2,...,71. (1) 



The index t denotes years, c denotes single customers and k denotes cohorts. CLV), 
is the CLV of customer c, invc is the investment from the enterprise in the customer 
at the beginning of the relationship and is the annual margin of customer c. The 
calculatory interest rate i is stipulated by financial controlling. In order to avoid large 
variations of the retention-density we calculate it on cohort level instead of single 




Early-Warning System to Support Activities in the Management ... 481 



customer level. All customer who have the same set-up date (quarter and year) as 
well as the same set-up channel of their relationship with the enterprise belong to 
the same cohort k and get the cohort specific retention-density at time t. The 
retention-density of cohort k at time t is calculated by: H — w, where is 

hk 

the churn-density of cohort k. Finally, p/^, is calculated by: p/^t = -rr, where h/^, 

’ ' "k ' 

denotes all customers of cohort k who have churned during period t and nk is the 
total cohort size at the beginning of the observation period t. Every month the values 
of nic^t and ^ are dynamically re-calculated and assumed as constant until the end 
of the cohort specific lifetime Tj^. In order to maximise the CE, the total sum of CLV 
based on all customers has to be maximised: 

C 

CE = Y^ CLVc max. (2) 

C=1 

As mentioned, the CLV is also applied as a treatment variable. In the frame- 
work of customer value management all customers are assigned to six different value 
classes: A, B, C, D, E, E. Approximately 2% of the top-customers are in class A. All 
customers with a negative CLV belong to class E. The remaining customers are uni- 
formly distributed into the classes B to E. Depending on the customer value class, 
incentives such as hardware discount, birthday emails with a gift, invitation to special 
events, service levels on customer touch-points etc. are offered to the customer. 

2.2 Customer lifecycle 

In addition to the CLV, we have also developed and implemented a CLC in order to 
support customer value management. The whole CLC consists of the following six 
customer life phases: set-up-phase, finding-phase, growth-phase, risk-phase, churn- 
phase and after-care-phase. The individual phases are initialised by various triggers. 
Depending on the trigger-status, each customer is assigned to his or her individual 
phase. Eor example, the finding-phase starts, respectively ends, after three, respec- 
tively twenty internet sessions. Each phase contains various phase-specific customer 
types of treatment. During the set-up-phase the customer receives special technical 
support, in the finding-phase we show the customer the product variety and dur- 
ing the growth-phase we focus on cross- and up-selling offers. We launch churn- 
prevention offers during the risk-phase and win-back offers in case of churn. After a 
successful win-back we try to bind the customer more closely during the after-care 
phase. 

2.3 Customer behaviour types 

In order to receive a better understanding of the customer needs we have developed 
and implemented CBT in an iterative procedure. In the first step we apply cluster 
analysis procedures (hierarchical and partition-based procedures) in order to identify 
groups of customers with homogeneous behaviour (Hand et al. (2001)). The second 
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step consists of the validation of the clusters with the use of discriminant analysis 
procedures (Hand et al. (2001)). As the CBT are used for strategic purposes, stabil- 
ity in the medium-term of time is necessary. For this purpose we conduct stability 
tests with the use of previous campaign data in the third step. In the final step, we 
execute a market research in order to profile the clusters. This result in the following 
eight CBT: occasional users, stay-at-homes, after eights, conservatives, profession- 
als, fun seekers, super surfers and 24-7 uppers and downers. Concerning customer 
value management, the communication design depends on the CBT, for instance. We 
observe significantly better response rates when we use a communication design de- 
pending on CBT rather than a general one. In addition, we trigger the development 
of new products as well as further developments of existing products depending on 
the CBT. 

2.4 Interaction between the different strategic customer controlling 
dimensions 

The three SCCD are hard-coded in the DWH. That means that each customer is 
assigned to a CLV class, a CLC phase and a CBT. The combination of the men- 
tioned six CLV classes, six CLC phases and eight CBT results in 288 permutations, 
which could be assigned with different business rules. Due to business restrictions, 
at the moment we are working with approximately 150 permutations which are as- 
signed with specific business recommendations. In the case of customer interaction 
(inbound and outbound), the stored information in DWH is linked with the corre- 
sponding business recommendation. This recommendation is available for the staff 
at the customer touch-point. The objective, in medium-term, is the integration of 
specific business rules for all 288 permutations. 

2.5 Spatial differences 

Apart from the SCCD already described, space is another important dimension con- 
cerning customer value management. To merely consider national distinctions in the 
customer base would cause an oversimplification of regional differences (Clique! 
(2006)). If you compare measured values like population density per square kilo- 
metre with a range from 40 to 3973 or spending capacity per person with a range 
from €10,050 to €26,440 (infas GEOdaten (2006)) it becomes obvious, that suc- 
cessful customer equity management relies on spatial differentiation. This approach 
requires a coherent system of different regional levels, which is determined by log- 
ical relations between the different levels. The most suitable choice is the official 
AGS-Structure (Amtlicher Gemeindeschlussel), which underlies a permanent up- 
date process by the Statistical Federal Office. This structure describes levels from 
the Federal State (Bundesland, AGS02) over the District (Kreis, AGS05) to the City 
(Stadt, AGS08), providing for each level an explicit identifier. Private data providers 
improve this structure by means of their own, more detailed levels beneath the City 
level (e.g. Kreis-Gemeinde-Schliissel (KGS)). The resulting micro-geographical lev- 
els allow appropriate ways of analysing and steering. 
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2.6 Interaction between strategic customer controlling dimensions and space 

In order to build up a framework which allows the maximisation of CE the com- 
bination between the SCCD CLV, CLC and CBT and the different regional levels 
from AGS02 up to AGS08 is necessary. The further multiplication of the 288 permu- 
tations regarding the SCCD, with 439 Districts (AGS05) in Germany, for example, 
would produce 126,432 permutations. This results In too much complexity and can- 
not automatically be assigned with different business rules In a real-life application. 
How have we solved this problem? First, if we detect signihcantly different regional 
developments then we temporarily overrule the Implemented business rule only for 
the concerned customers by a new one. The new business rule can consist of an ad- 
justment of the existing rule (e.g.: the price for the offered product-hundle is now 
€9.95 instead of €19.95), or can consist of a completely new rule (e.g. instead of a 
launch of a new product-bundle we offer 12-month free-of-charge usage of the ex- 
isting product). For all unconcerned customers the existing business rule will still be 
valid. After the various regional developments have disappeared we will switch off 
the overruling. Second, if we discover that certain regional developments are differ- 
ent in the long run, then we will analyse the developments in more depth. Depending 
on the results, two decisions are feasible: a) we adjust or enlarge one or more of the 
SCCD to cover the regional developments; b) the overruling remains valid until the 
different regional developments have disappeared. 



3 Early-warning system 

The necessity for an EWS has particularly arise from increasing competition. Many 
new competitors in the telecommunications market first focused on regional offers 
(e.g. Telecom Italia (e.g. Alice), Netcologne, Telebel) which resulted in massive re- 
gional churn und regional loss of customer equity. Regarding this, spatial analy- 
ses and usable rankings for managing customer equity were overdue. Apart from 
the wide range of accessibility, the EWS is described by its comprehensive spatial 
structure and an easy-to-understand way of presenting information. Finally, it can 
be applied as a digital, interactive map of Germany, which shows regions in differ- 
ent colours, depending on their risk conditions. Especially, by using cartographical 
imagery, it is possible to present complex figures in a compact way, offering an easy- 
to-use steering wheel to top management. The next two subsections will introduce 
the methodology and architecture of the EWS. 

3.1 Methodology 

The EWS visualizes risky areas regarding churn-density, new-customer-density as 
well as market-density on each regional level, AGS02 up to AGS08. The churn- 
density on regional level a in region b at time t is calculated by: 

la{h),t=‘^ a=l,2,...,A;h=l,2,...,B;t=l,2,...,r; (3) 

Wa{b),t 
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where Ua(p).t denotes all customers who have churned during period t on regional 
level a in region b and Wa(b),t denotes all existing customers at the beginning of the 
observation period t on regional level a in region b. In analogy, the regional new- 
customer-density is defined by the ratio of all customers who enter the enterprise 
during period t on regional level a in region b to all existing customers in period t 
on regional level a in region b. Finally, the regional market-density is defined by the 
ratio of all existing customers of the enterprise to all customers with a contract with 
an internet service provider in Germany in period t on regional level a in region b. In 
the final step, on each regional level the indicator variables churn-indicator, market- 
share-indicator and new-customer-indicator are calculated. 20% of the regions with 
the highest churn-density get the churn-indicator one, whereas 80% of the regions 
with the lowest churn-density get the churn-indicator zero. 20% of the regions with 
the lowest market-density get the market- share-indicator one, whereas 80% of the 
regions with the highest market-density get the market-share-indicator zero. Finally, 
20% of the regions with the lowest new-customer-density get the new-customer- 
indicator one, whereas 80% of the regions with the lowest new-customer-density 
get the new-customer-indicator zero. Every week the three mentioned indicators are 
dynamically re-calculated. Depending on the permutation and under consideration of 
the strategic business framework we have developed and implemented the following 
risk matrix with the corresponding risk level in the EWS (Eigure 1). The risk level 
vary from one (green), over four (yellow) up to eight (red). 
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Fig. 1. Risk Matrix and Risk Level 



3.2 EWS-Architecture 

The big challenge in the context of spatial marketing is the easy-to-use presentation 
of multidimensional customer data in spatial units as a reliable support for man- 
agement decisions (Schiissler (2006)). In order to process the necessary customer 
data, acquired information must be implemented at a very early stage in all relevant 
business processes. Eor that reason the EWS is completely integrated in the DWH 
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which allows the access to all sources of relevant data. That means, in order to re- 
ceive the described risk levels on each regional level in each region, every week 40 
million data fields have to be provided and processed. According to the guidelines 
of data protection, the whole process uses anonymous data and is optimised by the 
aggregation per regional level to the principles of data-avoidance and data-economy 
(Sleekier and Pepels (2006)). 

As already mentioned, the intent is to observe and understand regional develop- 
ments at an early stage in order to react quickly. When Individual permissions have 
been obtained, then customer contact can be made by the use of campaign manage- 
ment. With the help of an implemented DWH process designed to add the AGS- and 
KGS-Identifiers, the introduced statistics und risk levels are computed by SAS soft- 
ware (SAS Institute Inc.). The results are automatically written in a special database 
(Oracle Spatial) that is able to store not only alphanumerical objects, but also spa- 
tial data (Brinkhoff (2005)). This Oracle Spatial combines the results and the spatial 
data (e.g. the boundary of the City of Darmstadt and its risk rank) in one place, so 
that following processes can interact with the database based on industrial standards 
(Brinkhoff (2005)). By using a web-map-service (WMS) to present and transport 
the results to the user, the complexity is significantly reduced. Depending on an au- 
thorisation system, everyone in the company can access the EWS by browser via 
the intranet. So, this architecture concentrates rules and data In one point and thus 
guarantees the single source of truth for related questions. Without any technical ob- 
stacles and a low potential for misinterpretation, the EWS offers regional customer 
insight for further analysis, regional campaigns as well as decision support for the 
top-management. 



4 Empirical example 

With the use of the EWS (cf. Section 3) we analysed the market situation as it was in 
March 2006 in ten Districts (AGS05) in and around Berlin. We detected the following 
situation regarding the risk ranking (cf. Eigure 1): one with risk eight, four with risk 
six, two with risk five and finally fwo with risk four. In the final step, we adjusted the 
existing business rules derived from the SCCD (cf. Section 2) and launched special 
types of pilot treatment (e.g. special product bundles with a reduced price, special 
customer service, at home product installation free of charge etc.) in the ten men- 
tioned Districts from April until September 2006. Table 1 shows the change In the 
key figures during fhe observation period from March 2006 until October 2006 for 
the ten Districts in and around Berlin. 

The results show improvements regarding churn-density (reduced by 27%) and 
new-customer-density (increased by 12%) in comparison to the initial situation as 
in March 2006. As the total churn amount was still increasing, the market-density 
dropped by 2.2%. In spite of this development CE increased by a mere 0.2%. More 
detailed analysis reveals an increasing customer stock (sum of all customers of the 
enterprise) by 0.6% during the observation period. This means, in comparison with 
the initial situation in March 2006 the average CLV dropped but the CE increased. 
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Table 1. Change after the launch of special treatment 



Variable 


Change 


Chum-Density 


-27% 


New-Customer-Density 


+ 12% 


Market-Density 


-2.2% 


Customer-Equity 


+0.2% 



Additional, we have to draw attention to the control group. After observing the 
mentioned situation in the Districts in and around Berlin, we randomly selected three 
further Districts in the surrounding area. The chosen Districts were still being treated 
in the conventional way. In comparison to the focused Districts undergoing special 
pilot treatment, we observed a reduced market-density by 5%, a reduced CE by 0.8% 
and a reduced customer stock by 1.5%. The comparison reveals that the special types 
of pilot treatment work successfully. 



5 Conclusion 

We have shown that the SCCD allow a better understanding of customer behaviour 
and customer needs. In addition, we have pointed out, by means of a real-life exam- 
ple, that the EWS enables the detection of CE potentials respectively risks. Einally, 
we have illustrated that the interaction between EWS and SCCD allows the deduc- 
tion of CE inreasing types of treatment. 

On the one hand, the SCCD substantially reduce complexity in multivariate data- 
structures and allows the use of standardised processes in the case of customer inter- 
action. On the other hand, the interaction between SCCD and different regional levels 
increases complexity considerably. It remains to be examined whether the operation 
of a completely automatic customer value management depending on all possible 
permutations of three or more SCCD in interaction with all regions on all regional 
levels will be possible. The implementation of such an application in real-life busi- 
ness will be very important, because complexity in data structures is continuously 
rising due to an increasing amount of customer data and customer micro-segments. 
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Abstract. This paper introduces a finite-mixture version of the adjacent-category logit model 
for the classification of companies with respect to their marketing practices. The classification 
results are compared to conventional K-means clustering, as established for clustering market- 
ing practices in current publications. Both, the results of this comparison as well as a canonical 
discriminant analysis, emphasize the opportunity to offer fresh insights and to enrich empirical 
research in this domain. 



1 Introduction 

Although emerging markets and transition economies are attracting increasing atten- 
tion in marketing. Pels and Brodie (2004) argue that conventional marketing knowl- 
edge is not valid for these markets per se. Moreover, Burgess and Steenkamp (2006) 
claim that emerging markets offer unexploited research opportunities due to their sig- 
nificant departures from the assumptions of theories developed in the Western world, 
but call for more rigorous research in this domain. But, the majority of studies con- 
cerned with marketing in transition economies are either qualitative descriptive or 
restricted to simple cluster analysis. This paper seizes the challenge by: 

• introducing a finite-mixture approach facilitating the fitting of a response model 
and the clustering of observations simultaneously, 

• investigating whether or not the Western-type distinction between marketing mix 
management and relationship management holds for groups of companies from 
Russia and Lithuania, and 

• exploring the consistency of the marketing activities. 

The remainder of this paper is structured as follows. The next section provides a 
description of the research approach, which is embedded in the Contemporary Mar- 
keting Practices (CMP) Project. In the third section, a finite-mixture approach is 
introduced and criteria for determining the number of clusters in the data are dis- 
cussed. The data and the results of this study are outlined in section 4, and section 5 
concludes with a discussion of these results. 
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2 Knowledge on interactive marketing 

2.1 The research approach 

In their review of reasons for the current evolution of marketing Vargo and Lusch 
(2004) propose that all economies will change to service economies and that this 
change will foster the switch from transactional to relational marketing. Another ar- 
gument for emphasizing relational, particularly interactive marketing elements, is 
supported by the distinction of B2B from B2C markets. Unfortunately, these claims 
are based on deductive argumentation, literature analysis, and case studies, but are 
hardly supported by empirical analyses. Subsequently we aim to investigate the man- 
ner as to how marketing elements are combined in transition economies. 

For this purpose we distinguish the conventional transaction marketing approach 
(by means of managing the marketing-mix of product, place, price, and promotion) 
from four types of customer relationship and customer dialogue-oriented marketing 
(Coviello et al. (2002)): 

Transaction Marketing (TM): Companies attract and satisfy potential buyers by man- 
aging the marketing mix. They actively manage communication ‘to’ buyers in 
a mass-market. Moreover, the buyer-seller transactions are discrete and arm’s- 
length. 

Database Marketing (DM): Database technology enables the creation of individual 
relationships with customer. Companies aim to retain identified customers, al- 
though marketing is still ‘to’ the customer. Relationships as such are not close 
or interpersonal, but are facilitated and personalized through the use of database 
technology. 

E-Marketing (EM): Vendors use the Internet and other interactive technologies to 
sell products and services. The focus is on creating and mediating the dialogue 
between the organization and identified customers. 

Interaction Marketing (IM): A Face-to-face interaction between individuals main- 
tains a communication process truly ‘with’ the customer. Companies invest re- 
sources to develop a mutually beneficial and interpersonal relationship. 

Network Marketing (NM): All Marketing activities are embedded in the activities 
of a network of companies. All partners commit resources to develop their com- 
pany’s position in the network of company level relationships. 

For the empirical investigation of the relevance of these marketing types, a survey 
approach has been developed. The next subsection describes the results concerning 
marketing practices in transition economies already gained through this survey ap- 
proach. 

2.2 Benchmark: Already known clusters in transition economies 

Two studies of marketing practices in emerging markets have been published. The 
first study, by Pels & Brodie (2004), points out five distinct clusters of marketing 
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practices in the emerging Argentinean economy. This clustering is made up hy ap- 
plying the ^f-means algorithm to the respondent’s ratings describing their organiza- 
tion’s marketing activities. Working with the same questionnaires and applying the 
/T-means algorithm, Wagner (2005) revealed three clusters of marketing practices in 
the Russian transition economy. In the first study, the number of clusters was cho- 
sen with respect to the interpretability of the results, whereas in the second study, the 
number of clusters was estimated using the GAP-criterion. Both studies are restricted 
to the identification of groups of organizations with similar marketing practices, but 
do not address the relationships between particular elements of the marketing and 
relationship mix. To tackle these issues, a mixture regression model is introduced in 
the next section. In order to provide an assessment of the improvement due to the 
employment of more sophisticated methods, we will outline the results of applying 
^T-means clustering to the data at hand as well. 



3 A Finite Mixture approach for classifying marketing practices 
3.1 Response Model 

Mixture modeling enables the identification of the structure underlying the patterns 
of variables and the partition of the observations n{n= 1, . . . , A) into groups or seg- 
ments s {s = 1 , . . . , 5) with a similar response structure simultaneously. Assuming 
that each group is made up by a different generating process, n„s refers to the prob- 
ability for the observation to originate from the generating process of group i. 
Let '^ns = 1 with Tins > 0 Vn, 5 then the density of the observed response data is 
given by: 

S 

f(yn 0,0) = 'y ^ Kns{k = 0)_/)(y^ 0^) (1) 

,v=l 

with 

k ... nominal latent variable (k= 1 , . . . , 5) 
y„ . . . scalar response variable 

x„ . . . vector of variables influencing the latent variable k (covariates) 

0 ... vector of parameters quantifying the impact of x„ on k 
x„ . . . vector of variables influencing y„ (predictors) 

0^ . . . vector of parameters quantifying the impact of x„ on y„ in segment s 
0 . . . matrix of parameter vectors 0^ 

From equation 1, it is obvious, that this model differs from conventional latent class 
regression models because of the covariates x„ and the corresponding parameter vec- 
tor 0, which enable an argumentation of the segment membership. Thus, the covari- 
ates x„ differ from the x„, because the elements x„ are assumed to have impact on y„ 
by means of causality in the response structure, but only by means of segment mem- 
bership. For the application of clustering marketing practices, this feature seems to be 
relevant to capture the differences between organizations offering goods or services 
and serving B2C or B2B markets. 
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The distribution of conditional densities /i(y„|5,x„, 0.^) might be chosen from 
the exponential family, e.g., normal, Poisson, or binomial distribution. For the subse- 
quent application of analyzing an ordinal response using a logit approach, the canon- 
ical links are binominal (cf. McCullagh and Nelder (1989)). The link function of the 
adjacent-category logit model, with r = response categories, is given by: 

log ^ + {r=2,...,R;s=l,...,S) (2) 

with = 001s — 0ors Vs. Thus, the comparison of adjacent categories equals the 
estimation of binary logits. In order to utilize the information of the ordinal align- 
ment of the categories, a score ([)„ for each category r is introduced, so that 0g„ = 
— (])rs0^ Vg,s with (|)is > (|) 2 s > • • • > <^Rs (Anderson (1984)). Consequently, the prob- 
ability of choosing the category k is: 






exp(0g„-x;(|)„0s) 

Ef=iexp(0S,,-x;(t);s0s) 



( 3 ) 



Noticeably, the number of parameters to be estimated is highly affected by the num- 
ber of segments S, but increases just by one for each category. 



3.2 Criteria for deciding on the number of clusters 

To determine the optimal number of clusters from the structure of a given data set, 
distortion-based methods, such as the GAP-criterion or the Jump-criterion, have been 
found to be efficient for revealing the correct number (see Wagner et al. (2005) for 
a detailed discussion). In contrast to partitioning cluster algorithms, the fitting of 
response models usually involves the maximization of the likelihood function. 

N S 

In L ^ ^ ( z „,5 In (y „ 1 5 , x„ , 0, ) -f In ( k = j I x„ , 0 ) ) (4) 

n=l .v=l 



with 



J 1 , if observation n in segment s, 
[ 0, otherwise 



{n=l,...,N, s=l,...,S) 



Using this likelihood, the optimal number of clusters can be determined by minimiz- 
ing the Akaike Information Criterion: 



AIC=—2lnL + 2Q with Q ... number of parameters (5) 



Systematic evaluations of competing criteria revealed that the modified Akaike In- 
formation Criterion, 

MAIC=-2lnL + 3Q (6) 

outperforms AIC as well as other criteria such as, e.g., BIC or CAIC (Andrews and 
Currim (2003)). 
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4 Empirical application 

4.1 Data description and preprocessing 

The data are gathered in using the standardized questionnaires developed within the 
CMP project. The first sample of n\ = 32 observations was generated in the course 
of postgraduate management training in Moscow. A second sample of «2 = 40 obser- 
vations was gathered in cooperation with the European Bank for Reconstruction and 
Development. This sample differs from the first one because it includes organiza- 
tions based in St. Petersburg and a smaller town Yaroslavl located 250 km north-east 
of Moscow. A third sample of «3 = 28 observations was gathered on the lines of 
the first sample in the course of postgradual management training, but in Lithuania 
covering organizations based in the capital, Vilnius, and the city of Kaunas. Because 
of this particular structure of the data under consideration, we expect the data set 
to comprise observations from-statistically spoken-different generating processes. 
Thus, the mixture approach outlined in section 3.1 should fit the data better than 
simple approaches. 

Each of the five marketing concepfs (as depicfed in subsection 2.1) is measured 
using nine Likerf-fype scaled ifems describing fhe nafure of buyer-seller relations, 
the managerial intention, the spending of marketing budgets and the type of staff en- 
gaged in marketing activities (cf. Wagner (2005) for details). According to the pro- 
ceeding of Coviello et al. (2002), a factor score is computed from each of the nine sets 
of indicators. These make up the vectors of predictors x„. As outlined in section 2, 
two major reasons for introducing the new relational marketing concepts, DM, EM, 
IM, and NM, are the increasing importance of service marketing and the differences 
between industrial und consumer markets. Therefore, two binary variables indicating 
whether the organization n offers services and serves industrial markets are included 
as covariates x„ according to equation 1. Additionally, the questionnaires comprise 
a variable capturing an ordinal rating of the organization’s overall commitment to 
transactional marketing. This is the endogenous variable, y„, of the response model. 
The approach allows the quantification of combinations of marketing concepts as 
well as revealing substitutive relations. 

4.2 Results 

Table 1 depicts the results of fitting the model for 1 to 5 segments. 



Table 1. Model’s fit with different numbers of segments 



5 


logL 


AlC 


MAIC 


Class. Err. 


pseudo 


1 


-131.49 


280.97 


289.97 


.00 


.23 


2 


-117.04 


276.09 


297.09 


.09 


.81 


3 


-100.96 


267.92 


300.92 


.08 


.92 


4 


-89.76 


269.52 


314.52 


.11 


.92 


5 


-82.62 


279.24 


336.24 


.11 


.95 
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It is obvious from the table that the optimal number of segments according to the 
AIC und pseudo is 3. But, the MAIC does not confirm this advice, which holds 
for other criteria (e.g., BIC) as well. This result is surprising with respect to the 
discussion in subsection 3.2 (see Andrews and Currim (2003) for an explanation of 
data scenario’s impact). Table 2 provides the parameter estimates for the predictors 
and the covariates of the response model. 



Table 2. Parameter estimates for predictors and covariates 





Segment f 


Segment 2 


Segment 3 


Wald-Statistic 


inner segment 


.87 


.81 


.89 


- 


0()1j 


-17.01 


-2.63 


-7.68 


19.25 


002.! 


-7.96 


6.51 


1.97 


biased 


0O3i 


5.45 


2.76 


7.82 


biased 


004.! 


8.93 


2.10 


2.59 


biased 


005.! 


10.59 


-8.75 


-4.70 


biased 


TM score 


3.67 


4.05 


.88 


10.29 


DM score 


2.71 


1.03 


-7.87 


8.94 


EM score 


-2.44 


-.64 


1.37 


6.62 


IM score 


1.15 


.23 


6.52 


5.17 


NM score 


-.44 


.29 


2.99 


3.03 


Intercept Covariates 


.83 


-.82 


-.01 


6.97 


B2B markets 


-.32 


-1.36 


1.68 


6.86 


services 


.20 


1.90 


-2.10 


8.81 



Obviously, the model hts with all three segments. The organizations assigned to 
segment 1 are offering goods and services to Russian consumers rather than to busi- 
ness customers. As expected, the score for TM is positively related to the dependent 
variable (overall commitment towards transactional marketing), but the parameters 
for DM and IM are positive as well. Thus, the organizations combine TM with these 
relational marketing elements, but substitute for EM and NM. The organizations in 
segment 2 are offering services to consumer markets. In contrast to Western- type 
marketing folklore, the estimated parameter for TM is the highest positive parameter 
for this segment. The organizations in segment 3 are selling industrial goods. In line 
with conventional theory, the TM score is positive, but not substantial. Moreover, 
IM appears to be most important for these organizations and they have the highest 
parameter of all three segments for NM. So far, the results match the theory, but 
interestingly, these organizations refrain from engaging in DM. 

Table 3 provides a comparison of fT-means clustering with the classification of 
the response model. The number of three clusters In /f-means clustering has been 
chosen to achieve a grouping comparable to the finite-mixture approach, but is also 
confirmed by the GAP-criterion. 
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Table 3. Comparison with Ai-means clustering 



Segment 1 Segment 2 Segment 3 Total 



Cluster I 


27 


13 


10 


50 


Cluster II 


9 


4 


1 


14 


Cluster III 


19 


8 


9 


36 


Total 


55 


25 


20 


100 



Surprisingly, the grouping with respect to the response structure differs completely 
from the grouping conventional AT-means clustering. Particularly, cluster I, which 
covers half of all observations, spreads over all segments and, inversely, the organi- 
zations of segment 1 are matched to all three clusters. This result is confirmed by the 
projection of the clustering solutions in a plane of two canonical discriminant axes 
depicted in figure 1 . 



Ai-means finite-mixture 




-I- . . . Cluster I, Sample 1 □ . . . Cluster II, Sample 1 «... Cluster III, Sample 1 

X . . . Cluster I, Sample 2 O . . . Cluster II, Sample 2 o. . . Cluster III, Sample 2 

* . . . Cluster I, Sample 3 A. . . Cluster II, Sample 3 #. . . Cluster III, Sample 3 

Fig. 1. Canonical discriminant spaces of groupings 



The horizontal axis in the left-hand figure accounts for 82.03 % of the data vari- 
ance, the vertical axis for the remaining 17.97 % of this solution (Wilks’ A = .15, 
F-statistic = 21 . 17). Here, the clusters are well separated and non-overlapping. Clus- 
ter I and cluster III comprise observations from all three samples, while cluster II is 
dominated by observations from the Lithuanian sample. In the right-hand figure, the 
horizontal axis accounts for 89.63 % of the data variance, the vertical axis for the 
remaining 17.97 % of this solution (Wilks’ A = .56, 7^-statistic= 4.40). The struc- 
ture of segments is not reproduced by the canonical discriminant analysis, although 
the same predictors and covariates were used. All segments are highly overlapping. 
Thus, it is argued that the grouping, with respect to the response structure, offers 
new insights into the structure underlying the data, which can not be revealed by the 
clustering methods prevailing in current marketing research publications. 
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5 Conclusions 



The finite-mixture approach introduced in this paper facilitates the fitting of a re- 
sponse model and the clustering of observations simultaneously. It reveals a struc- 
ture underlying the data, which has been shown not to be feasible by conventional 
clustering algorithms. Analyzing the relevance of the new relationship paradigm for 
marketing in transition economies (Russia and Lithuania), this study clarifies that 
the borderline is not simply described by distinguishing services vs. goods markets 
or industrial vs. consumer markets. 

The empirical results give reasons for rethinking the relevance of Western mar- 
keting folklore for transition economies, because only the marketing practices of one 
group of organizations targeting industrial customers fit the Western-type guidelines. 
Thus, this study confirms conjectures drawn from previous studies, but takes advan- 
tage of a more rigorous approach for an interpretation rather than a description of 
classification of contemporary marketing practices in transition economies. 
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Abstract. We derive a vector autoregressive (VAR) representation from the dynamic divi- 
dend discount model to predict stock returns. This valuation approach with time-varying ex- 
pected returns is augmented with macroeconomic variables that should explain time variation 
in expected returns and cash flows. The VAR is estimated by a Bayesian approach to reduce 
some of the statistical problems of earlier studies. This model is applied to forecasting the 
returns of a portfolio of large German firms. While the absolute forecasting performance of 
the Bayesian vector-autoregressive model (BVAR) is not significantly different from a naive 
no-change forecast, the predictions of the BVAR are better than alternative time-series mod- 
els. When including past stock returns instead of macroeconomic variables, the forecasting 
performance becomes superior relative to the naive no-change forecast especially over longer 
horizons. 



1 Introduction 

The prediction of asset returns has been a pivotal area of research in financial eco- 
nomics since the beginning of the last century. For many decades the common aca- 
demic belief was that asset prices followed a random walk in the short and in the 
long run. In contrast, linear regression studies in the late 1980s and 1990s and more 
recent extensions of these studies show a certain degree of predictability (Fama and 
French (1988) and Hodrick (1992), Cremers (2002), Avramov and Chordia (2006), 
respectively). Predictability, however, does not necessarily imply market inefficiency 
(Kaul (1996)). Rather, time-varying expected returns can lead to return predictability 
in a risk-averse world that is consistent with rational behavior and efficient markets. 
In addition, time- varying expected returns may explain the excess volatility puzzle 
which states that asset prices fluctuate too widely to be rationally explained by vari- 
ation in fundamentals. Investors are believed to be relatively less risk avers during 
boom periods, thus, demanding only a low risk premium whereas they are relatively 
more risk avers during recessions requiring a higher risk premium. The objective of 
this study is to add to our understanding of stock price fluctuations by deriving a 
vector autoregressive (VAR) form of the dynamic dividend discount model for pre- 
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dieting stock returns. This model is augmented with macroeconomic variables in 
order to explain the time variation in expected returns and cash flows. 

In the next section we review the literature and in section 3 we derive a BVAR 
model from the dynamic dividend discount model in order to explain the conditional 
distribution of returns over time. This approach combines various extensions of the 
early linear regression studies. The empirical results are presented in section 4 and 
the last section concludes the paper. 



2 Literature review 

Many studies have tried to explain expected stock returns with fundamental vari- 
ables. An overview of the literature is provided in Table 1. Most of these studies 



Table 1. Comparison of variables used in linear regressions. 



Linear Regressions 


Author(s) 


Sample 


12 3 4 


5 6 


7 8 


9 1011 12 13 


Chenetal. (1986) 


1953-1983 


X 


X 


X X 


X X 


X X 


Campbell (1987) 


1959-1983 


X 




X X 


X 




Harvey (1989) 


1941-1987 


X X 


X 


X 


X X 


X 


Person (1990) 


1947-1985 


X 




X X 


X 




Person and Harvey (1991) 


1959-1986 


X X 


X 


X 


X X 


X 


Person and Harvey (1993) 


1970-1989 


X 




X 




X X 


Whitelaw (1994) 


1953-1989 


X 


X 


X 


X 




Pesaran and Timmermann (1995) 1954-1992 


X X 




X X 


X 


X X 


Pontiff and Schall (1998) 


1926-1994 


X 


X 


X 


X 




Person and Harvey (1999) 


1963-1994 


X 


X 


X 


X X 




Bossaerts and Hillion (1999) 


1956-1995 


X X X X 




X 


X 


X X 


Cremers (2002) 


1994-1998 


XXX 


X X 


X X 


X X 


XXX 



Reproduced from Cremers (2002, p. 1226). For references see Cremers (2002). 



Variables: 1 - lagged return; 2 - dividend yield; 3 - P/E-ratio; 4 - payout ratio; 5 - trading 
volume; 6 - default spread; 7 - yield on T-bill; 8 - change in yield on T-bill; 9 - term spread; 
10 - yield spread between overnight fixed income security and T-bill; 11 - January dummy; 
12 - growth rate of industrial production; 13 - change in inflation or unexpected inflation. 



were able to detect some degree of predictability. In these cases predictability is usu- 
ally dehned as at least one non-zero coefficient in a regression model using one or 
a combination of lagged fundamental variables (Table 1) to predict future returns 
(Kaul (1996)). Some of these studies used overlapping observations which may have 
resulted in autocorrelated residuals. In addition, a small sample bias could have been 
caused by the endogeneity of lagged explanatory variables (Hodrick (1992)). If the 
dependent and at least one of the independent variables are non-stationary, the re- 
lationship between expected returns and fundamental variables might be spurious 
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(Person and Sarkissian (2003)). Unfortunately, the small sample bias and a spurious 
regression reinforce each other. 

To mitigate the problems due to these biases, several modifications have been 
suggested in the literature (Plodrick (1992)). In order to correct the inferences we 
introduce lags of explanatory variables which then results in a VAR representation. 
Moreover, determining the best predictors is difficult as various studies find differenf 
variables to be good predictors. This observation indicates that data mining might 
be at work. More recent studies use Bayesian approaches in order to integrate more 
variables and condition inferences on the whole set of potential predictors (Avramov 
(2002), Cremers (2002), Avramov and Chordia (2006)). These studies still find pre- 
dictability based on both forecasting errors and profitability of investment strategies. 

In recent years, Bayesian methods are becoming more prominent in finance such 
as asset pricing, portfolio choice, and performance evaluation of mutual funds. An 
important advantage of the Bayesian methods is that they yield the complete distri- 
bution of the model parameters. Thus, estimation risk can be incorporated into asset 
allocation decisions. Furthermore, applying Bayesian techniques in asset allocation 
decisions leads to much more stable portfolio weights than with classical portfolio 
optimization (Barberis (2000)). 



3 Model 



3.1 Dynamic dividend discount model 



The forecasting equation is derived from the dividend discount model with time- 
varying expected returns: 



P,=Et 



i 



A 

(1+A)' 



( 1 ) 



As this equation is non-linear, we use the log-linear approximation of Campbell and 
Shiller (1989) and take expectations: 



Pt — ^ _ p + A ~ p)A+i+i — rt+i+i\ I (2) 

where k and p are constants and lower case letters denote logarithms. Subtracting 
equation (2) from the logarithmic dividend results in: 



dt-pt = - 



1-p 



+ E, 



p' {rt+l+i — AA+1+i) 

. 1=0 



( 3 ) 



From equation (3) it can be seen that the current (logarithmic) dividend yield is a 
good predictor for future expected returns. The intuition behind that equation is that 
the dividend yield itself is a stationary variable. Thus, if the variable is above its 
long-run mean either the price has to increase, the dividend has to fall or both effects 
have to occur at the same time. Because prices usually fluctuate more widely than 
dividends, a subsequent price change rather than a dividend change is more likely. 
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3.2 Bayesian vector autoregressive model 

The dynamic dividend discount model can be augmented by macroeconomic vari- 
ables that are able to explain expected returns and cash flows over time. The resulting 
model can be written in a general VAR form: 



yt = ^\yt-\+zt (4) 

The vector y contains returns, dividend yields, and macroeconomic variables. Note 
that in general any VAR(;t) model can be stacked and represented as VAR(l) model 
as in equation (4). We introduce prior information into the model by using a Bayesian 
approach. This imposes structure on the model in a flexible way and downsizes the 
impact of shocks on our forecasts as shocks do not tend to repeat themselves in 
the same manner in the future. Furthermore, a larger number of variables and lags 
can be included than in the classical case without the threat of overfitting. BVAR 
models were used to predict business cycles which tend to be the main driver of 
our valuation equation (Litterman (1986)). In addition, BVAR models have proved 
to be valuable in other applications such as the prediction of foreign exchange rates 
(Sarantis (2006)). 

In order to keep the definition of the prior capable, we impose prior means that 
suggest a random walk structure for the stock price and three hyperparameters for 
the prior precision as suggested by Litterman (1986). This assumes more explanatory 
power of own lags in contrast to other variables and decreasing explanatory power 
with increasing lag length for each equation in the VAR. In the estimation we follow 
Litterman (1986) and use his extension of the mixed-estimation approach. 



4 Empirical study 

To evaluate our model we compare its forecasting quality with five benchmark model 
types. We form an equally-weighted portfolio of ten arbitrarily chosen DAX-stocks 
for the period from 01:1992 to 01:2005 based on monthly returns. In order to simu- 
late a real-time forecasting strategy we use 59 rolling windows of 90 months in order 
to estimate (and in some cases optimize) the model and predict the portfolio returns 
for a horizon of one to 15 months out-of-sample. As benchmark models we use (1) 
a random walk as naive forecast, (2) an AR(1) model as simple time-series model, 
(3) dynamic ARIMA models that are optimized for each rolling window using the 
Schwartz-Bayes criterion (differing in the maximum lag length), (4) linear regres- 
sions (static and dynamic versions using a stepwise regression approach), and (5) 
classical VAR models. 

We employ the dividend yield of the portfolio and a total of 30 different macroe- 
conomic variables such as interest rates, sentiment indicators, implied volatilities and 
foreign exchange rates in order to construct 28 different model specifications of our 
model types. As the results are comparable to our general findings, we focus on the 
models using the dividend yield, the change in GDP and the change in the unem- 
ployment rate as predictors. To judge the quality of our forecasts, we apply a direct 
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approach in that we use squared forecasting errors as well as mean squared errors 
(MSE). 

4.1 1-step forecasts 

First, we replicate the results from previous studies that found predictability based 
on the significance of the predictor variables. As presented in Figure 1 the dividend 
yield is significant for almost 60 % of the sample and the change in GDP is significant 
for the entire sample. The change in the unemployment rate becomes insignificant 
after the first rolling window. The time-varying pattern of the t-values and the co- 



Comparison of t-values 




Fig. 1. Comparison of t-values of linear regression over time 



efficients of the predictors (not reported) are in line with Bessler and Opfer (2004). 
Thus, model uncertainty is not a static problem but rather needs to be taken into ac- 
count over time as well. A comparison of squared errors over time for the six models 
reveals that the AR(1) and the BVAR yield the best results (Figure 2). All models 
show a similar pattern with peaks occurring in those months when returns are both 
very high in magnitude and negative most of the time. Thus, such extreme returns 
cannot be predicted with these models. By taking a closer look at the two models with 
the best forecasting performance, i. e. the AR(1) and the BVAR, two issues should 
be noted. The difference in squared errors between the AR(1) and the BVAR in the 
upper panel of Figure 3 reveals that the BVAR is superior in normal markets. Never- 
theless, the AR(1) performs better in down markets. This can be explained with the 
increasing (positive) autocorrelation in returns during market turmoil. The change in 
autocorrelation is not reflected in the BVAR as its forecasts do not respond to shocks 
as quickly as the AR(1) forecasts. However, a comparison of the MSE for the six 
models indicates that none of the models produces significantly better forecasts than 
a naive forecast. 
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Fig. 2. Squared forecasting errors for 1-step ahead forecasts over time 
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Fig. 3. Comparison of squared forecasting errors for BVAR and random walk 



4.2 1- to 15-step forecasts 

By looking at longer forecasting horizons of up to 15 months, the dominance of the 
BAVR becomes even more pronounced (Figure 4). Flow ever, the simple AR(1) still 
provides comparable results. 

4.3 Single stocks as variables 

An interesting result emerges when we substitute the macroeconomic variables in 
the BVAR with the return series of the ten stocks of the portfolio. The forecasting 
performance of the BVAR based on single stock returns improves significantly. For 
example, over a forecasting horizon of 12 months the MSB of the BVAR is about 
3 percentage points smaller than the MSB of a naive forecast. The superior results 
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Random Walk (#14) 




Forecasting horizon 
AR(1)-Model (#13) 






Forecasting horizon 

Box Jenkins (# 10) 
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Fig. 4. Squared forecasting errors for 1- to 15-step ahead forecasts 



based on single stocks rather than macroeconomic variables can not be explained 
by a decoupling of returns from macroeconomic factors during the rise and fall of 
the new economy era. The MSB for the subsample including the downturn up to 
03:2003 is only about 1 percentage point smaller than the naive forecast’s MSB over 
all forecasting horizons. In contrast, for the subsample from 04:2003 onwards, the 
MSB is at least 2 percentage points smaller than that of a naive forecast and again 
the lowest for a 12 months horizon (6.5 percentage points smaller) which implies a 
high degree of predictability. 



5 Conclusion and outlook 

The objective of this study was to evaluate the forecasting performance of BVAR 
models for stock returns relative to hve benchmark models. Our results suggest that 
even if we can reproduce the predictability results of earlier studies based on the 
significance of parameters none of the models based on macroeconomic variables 
is capable of predicting stock returns as measured by forecasting errors. However, 
there is a certain degree of predictability of the BVAR when we use the returns of 
single stocks instead of macroeconomic variables. Thus, it seems worthwhile to take 
a closer look at the cross-correlation structure of stock returns over monthly horizons. 
Bor future studies we suggest to use asymmetric weighting matrices for the prior 
that take into account the differences between industries (cyclical vs. non-cyclical) 
and sizes of the companies. Alternatively, our methodology could be extended to an 
application on bond markets as it is derived from a simple present-value relation. 
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Abstract. In this paper we aim to investigate the consistency of the certification model against 
the adverse selection model with respect to the operational performances of venture-backed 
(VB) IPOs. We analyse a set of economic-financial variables an Italian IPOs sample between 
1995 and 2004. After non-parametric tests, to take into account the non-normal, multivari- 
ate nature of the problem, we propose a non-parametric regression model, i.e. Partial Least 
Squares, as appropriate investigative tool. 



1 Introduction 

In financial literature the performance evaluation of venture backed IPOs has stim- 
ulated an important debate. Two are the main theoretical approaches. The first one 
has pointed out the certihcation role and the value added services of venture capi- 
talists. The second one has emphasized the negative effects of adverse selection and 
opportunistic behaviours on IPOs under-performance, especially with respect to the 
timing of the IPOs. 

In different studies (Wang et ai, 2003; Brau et al, 2004; Del Colle et ai, 2006) 
parametric tests and Ordinary Least Squares regression have been proposed as inves- 
tigative tools. In this work we investigate complicated effects of adverse selection 
and conflict of interests by non-parametric statistical approaches. Underlining the 
non-normal data distribution, we propose as appropriate instruments non-parametric 
tests and Partial Least Squares regression model (PLS; Tenenhaus, 1998; Durand, 
2001). At first we test if the differences of operational performances are significant 
between the pre-IPOs sample and post-IPOs sample. Next, given the complicated 
multivariate nature of the problem, we study the dependence relationships of firm 
performance (measured by ROE) from quantitative and qualitative variables of con- 
text (like market conditions). 
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2 The theoretical financial background: the certification model 
and the adverse selection model 

The common denominator of theoretical approaches on venture capitalist role is rep- 
resented by the asymmetric information management. On one hand, the certihcation 
model considers an efficient solution of this question, due to scouting process and ac- 
tivism of private equity agents. More specifically, the certification model takes into 
account the selection capacity and the monitoring function of venture capitalists that 
allow to make better resources allocation and better control systems than other finan- 
cial solutions (Barry ed al., 1990; Sahlman, 1990; Magginson e Weiss, 1991; Jain e 
Kini, 1995; Rajan e Zingales, 2004). Consequently, this model predicts good perfor- 
mances of venture backed firms, even better than non backed ones. The causes of this 
effect ought to be: more stable corporate relations; strict covenants; frequent opera- 
tional control activities; board participation; stage financing options. These aspects 
should compensate the incomplete descriptive contractual structure that follows ev- 
ery transaction, allowing a more efficient management of the asymmetric information 
problem. So, venture backed IPOs should generate good performances In terms of 
growth, profitability and financial robustness, even better if they are compared with 
non backed ones. 

On the other hand, IPOs under-performance could be related to adverse selection pro- 
cesses, even if these companies are participated by a venture capitalist. In this case 
two related aspects should be considered. The first one is that not necessarily the 
best firms are selected by venture capital agents. The second one is that the timing of 
IPO cannot coincide with a new cycle of growth or with an increase in profitability. 
Relatively to the first matter, some factors could determine a disincentive to accept 
the venture capital way in, such as latent costs, loose of control rights and income 
sharing. At the same time, the quotation option could not match an efficient signal 
towards the market. According to the packing order theory, the IPO choice can be 
neglected or rejected at all by the firms that are capable to create value by them- 
selves, without the financial support of a fund or the stock exchange. At first, low 
quality company, could receive more incentives to the quotation if the value assigned 
by the market exceeds inside expectations, especially during bubble periods (Ben- 
ninga, 2005; Coakley et al. 2004). In this situation, venture capitalist could assume 
an insider approach too, for example stimulating an anticipated IPO, as described by 
the grandstanding model (Gompers, 1996; Lee and Wahal, 2004). At second, ven- 
ture capitalists could be in conflict of interests towards the market when they have to 
accelerate the capital turnover. This is a big question if the venture capitalist operate 
like an intermediary of resources obtained during the fund raising process. In this 
case, the venture capitalist assumes a double role: he is a principal with respect to 
the target company; but he is an agent with respect to the fund, configuring a more 
complex, onerous, therefore less efficient agency nexus model. So the hypothesis is 
that a not efficient management of asymmetric information can also explain the VB 
IPOs under-performance, confuting the assumption of superior IPOs results com- 
pared to non- VB IPOs (Wang et. Al, 2003; Brau et al., 2004). The opportunistic 
behaviours of previous shareholders could not be moderated by venture capitalist’s 
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Table 1. Wilcoxon Signed Rank Test in VB IPOs: Testl=H0: Mej\ = Mej2\ Test2= 
HOiMcj-i = Test3= 



Ratios 


Mej\ 


Mej2 


Mej2 


Testl 


Test2 


Test3 


ROS 


9.97 


7.34 


5.39 


-0.87 


-1.78* 


-1.66* 


ROE 


9.75 


6.84 


-1.51 


-1.16 


-1.91* 


-1.66** 


ROI 


7.33 


6.69 


3.30 


-1.35 


-1.79* 


-1.73* 


Leverage 292.67 


79.75 


226.96 


-3.29*** -0.09 


-2 54*** 



Table 2. Mann- Whitney Test comparison in VB IPOs: Testl=H0: McvEti ~ ^^NonVEri’ 
Test2=H0:MeyBj,3 = Mej^onVEn 



Ratios 


VBt2 


N onV Bj 2 




NonVBj2 


Testl 


Test2 


ROS 


9.52 


3.93 


5.39 


2.16 


103 


116 


ROE 


6.7 


3.83 


3.3 


2.01 


111 


110 


ROI 


6.85 


3.83 


-1.51 


1.5 


113 


105 


Leverage 


79.75 


72.52 


226.96 


88.28 


120 


58** 



governance solutions. Furthermore, venture capitalists could even incentive a specu- 
lative approach to maximize and anticipate the way out from low quality companies, 
dimming their hypothetical certification function. 



3 Data set and non-parametric hypothesis tests 

The study of the Italian venture backed IPOs is based on a sample of 17 compa- 
nies listed from 1995 to 2004. The universe consists of 28 manufacturing companies 
that have gone public after the way in of a formal venture capitalist with a minority 
participation. In addition to the principal group, we have composed a control sample 
represented by non-venture backed IPOs comparable by industries and size. The per- 
formance analysis is based on balance sheets ratios. In particular, the study assumes 
the prohtability and the hnancial robustness as the main parameters to evaluate op- 
erational competitiveness before and after the quotation. Ratios are referred to three 
critical moments, or terms of the factor, called events, consisting in deal-year (Tl), 
IPO-year (T2) and first year post-IPO (T3). At first we test the performance differ- 
ences of balance sheet ratios within the venture backed IPOs with respect to the three 
events (Tl, T2, T3). Successively we test signihcant difference between the two inde- 
pendent samples of VB IPOs and non-VB IPOs. For the particular sample character- 
istics (non-normal distribution and eteroschedasticity) we consider non-parametric 
tests like Wilcoxon signed rank test (Wilcoxon and Wilcox, 1964) for paired depen- 
dent observations and Mann- Whitney test (Mann and Whitney, 1947) for compar- 
isons of independent samples. Coherently with the adverse selection model, we test 
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if the venture backed companies show an operational underperformance between the 
pre-IPO and post-IPO phases. 

Subsequently, coherently to the certification model, we test if the venture backed 
companies have the best performance if compared with non venture backed IPOs. 
The statistics of VB IPOs show an underperformance trend of venture backed com- 
panies during the three defined terms. In particular, all the profitability ratios decline 
constantly. Moreover, we find an high level of leverage (Debt/Equity) at the deal mo- 
ment, and in the first year post-IPO the financial robustness goes down again very 
rapidly. So the prediction of a re-balancing effect on financial structure has been con- 
sidered only with respect to the IPOs events (see table 1). The results of Wilcoxon 
Signed Rank Test have been reported in table 1. The null hypothesis is confirmed for 
profitability parameters comparing ratio medians of T1 and T2 moments, whereas 
the differences between ratio medians of T1 and T3 and T2 and T3 are significant 
(the significant differences are marked by the symbols: *=10%, **=5%, ***=1%). 
So the profitability breakdown is mainly a post-IPO problem, with a negative effect 
of leverage. These results suggests that venture capitalists do not add value in the 
post-IPO period, otherwise, the adverse selection moderates the certification func- 
tion and the best practice effects expected from venture capital solutions. 
Furthermore we test the hypothesis that VB IPOs generate superior operating perfor- 
mance compared with non-venture IPOs. Using the Mann-Whitney test, we compare 
IPO-ratios of the two independent samples. The findings show no significant differ- 
ence between the samples at the IPO-time and at the first year post-IPO; only the 
leverage level shows an higher growth in the venture group than in non- venture one, 
confirming the contraction of financial robustness and the loss of the re-balanclng 
effect on financial structure produced by the IPOs (see table 2). In conclusion the 
test results are more consistent with the adverse selection theory. 

Underlining the multivariate, non-normal nature of the problem, after hypothesis 
tests, we propose to investigate VB performance by a suitable non-parametric re- 
gression model. 



4 Multivariate investigation tools: Partial Least squares 
regression model 

In presence of a low-ratio of observations to variables and In case of multicolllnearlty 
in the predictors, a natural extension of the multiple linear regression is PUS regres- 
sion model. It has been promoted in the chemiometrics literature as an alternative to 
ordinary least squares (OLS) in the poorly or ill-conditioned problems (Tenenhaus, 
1998). Let Y be the categorical n,q response matrix and X the n,p matrix of the 
predictors observed on the same n statistical units. The resulting transformed pre- 
dictors are called latent structures or latent variables. In particular, PUS chooses the 
latent variables as a series of orthogonal linear combinations (under a suitable con- 
straint) that have maximal covariance with linear combinations of Y. PLS constructs 
a sequence of centered and uncorrelated exploratory variables, i.e. the PLS (latent) 
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components (t*, Let Eq = X and Fq = Y be the design and response data ma- 
trices, respectively. Define = E^_iw^ and where the weighting unit 

vectors and are computed by maximizing the covariance between linear com- 
promises of the updated predictor and response variables, max[cov{ik,VLk)]- 
Update the new variables E,t and F^t as the residuals of the least-squares regression 
on the component previously computed. 

The number A of the retained latent variables, also called the model dimension, is 
usually estimated by cross-validation (CV). 

Two particular properties make PLS attractive and establish a link between the geo- 
metrical data analysis and the usual regression. First, when A = rank X, 

PLS(X,Y) = {OLS(X,Y^)}j=i_g, 
if the OLS regression exists. 

Second, the principal component analysis, PCA, of X can be viewed as the “self- 
PLS" regression of X onto itself, 

PLS(X,Y = X) =PCA(X). 

PLS regression model has the following properties: efficient in spite of low ratio 
of observations on column dimension of X; efficient in the multi-collinear context 
for predictors (concurvity); robust against extreme values of predictors (local poly- 
nomials). The PLS regression model examines the predictors of ROE at IPO-year 
(T2) as variables of performance of VB IPOs companies. The predictor variables 
are: one quantitative (the leverage measured at the year of the venture capital way 
in, LEVERAGETl) and four qualitative: 1) the short time interval between the deal 
and the IPO time (I year by-deal, lYby deal; 2 year by-deal, 2Yby deal); 2) the 
size of companies listed, (SME; Large); 3) the trend of Milan Stock Exchange, (Hot 
Market Hotmkt, Normal Market, NORMmkt); 4) the origin of fund, (Bank Fund; 
non-Bank Fund, N-Bank Fund). The building-model stage consists of finding a bal- 
ance between goodness of fit and prediction and thriftness. The goodness of ht is 
valued by R^{A), in our study is equal to 60%, and the thriftness by PRESS criterion, 
the dimension space suggested by PRESS is A = 1. By PLS regression we want to 
verify the effects of some variables which could subtend opportunistic approaches. 
Moreover, the analysis is concentrated on the effect of independent variables that 
could allow the recognition of a conflict of interests between venture agents and the 
new stockholders. The importance of each predictors on the response is evaluated 
by looking at regression coefficients ((3) whose graphical representation is given in 
hgure 1. For example the regression coefficient value of leverage at the deal-time 
is a predictor of under-performance in the IPO year (Pleverageti = —0.36). This 
hnding is consistent with the assumption that adverse selection at the deal reflect its 
effects when the target firm is listed, especially when the gap between these two mo- 
ments is very short. We could also say that pre-IPO poorly performing firms continue 
to produce bad performance afterward too. 

Concerning the qualitative predictors, the interval time (PiYbydeai = —0.17) and the 
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firm size (Psme = —0.17) are useful variables to capture the influence of a too early 
quotation, similarly to the grandstanding approach. The market trend (PnoTinkt = 
—0.13) is useful to verify the impact of a speculative bubble on IPOs performance. 
Furthermore, the origin of fund (PpundBank = —0.17) it’s necessary to evaluate the 
potential conflict of interest of an agent that covers a double role: banking and ven- 
ture financing. All these variables summarize the risk of an adverse selection pro- 




Fig. 1. Decreasing Influence of Qualitative and Quantitative Predictors on ROE-T2. 



cess and speculative approach that can contrast the certification function of venture 
capitalist investments. So, in the first place the leverage, reached after the venture 
capitalist way in, is the most negative predictor of ROE at IPO time. In the second 
place, the shorter are the time intervals between the deal and IPO time, the worst is 
the influence on ROE. In the third place, the Arm size SME is a relevant predictor 
too. In fact, smaller and less structured enterprises have a negative incidence on IPOs 
operating performance. In the fourth place, even the market trend seems to assume 
a significant role to explain the VB IPO under-performance. More specifically, hot 
issues HOTmkt determine a negative effect on ROE. Einally, in a less relevant posi- 
tion there is the fund origin Fund Bank, for this variable the theoretical assumption 
is confirmed too, because of the negative influence of bank based agents. In synthe- 
sis we can say that ROE under-performance depends from the following predictors: 
LEVERAGEl, lYby deal, HOTmkt, SME. So, coherently with inferential tests, 
the PLS findings related to the IPO segment of the Italian Private Equity Market 
move away the venture finance solution from fhe theoretical certification function. 
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5 Conclusion 

The results of the non-parametric tests as well as the more complete multivariate 
dependence model show that operational performances of VB IPOs are signihcantly 
consistent with the adverse selection and opportunistic model. Specifically, a large 
part of IPOs under-performance is due to the leverage ’’abuse” at the Deal-Time, and 
the PLS regression shows that too early quotation by-deal, hot issues and small firm 
size are all predictors of profitability falls. Probably we should rethink a ’’romantic” 
vision about the venture capitalist role: sometimes he is simply an agent in conflict of 
interest, or he has not always the skill to select the best firms for the financial market. 
Obviously there are a lot of implications for further research and developments of 
this work. An international comparison with other financial systems and a further 
supply and demand analysis ought to be carried out. 
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Abstract. Owing to the huge size of the credit markets, even small improvements in clas- 
sification accuracy might considerably reduce effective misclassification costs experienced 
by banks. Support vector machines (SVM) are useful classification methods for credit client 
scoring. Flowever, the urgent need to further boost classification performance as well as the 
stability of results in applications leads the machine learning community into developing SVM 
with multiple kernels and many other combined approaches. Using a data set from a German 
bank, we first examine the effects of combining a large number of base SVM on classifica- 
tion performance and robustness. The base models are trained on different sets of reduced 
client characteristics and may also use different kernels. Furthermore, using censored outputs 
of multiple SVM models leads to more reliable predictions in most cases. But there also re- 
mains a credit client subset that seems to be unpredictable. We show that in unbalanced data 
sets, most common in credit scoring, some minor adjustments may overcome this weakness. 
We then compare our results to the results obtained earlier with more traditional, single SVM 
credit scoring models. 



1 Introduction 

Classifier combinations are used in tbe hope of improving tbe out-of-sample classifi- 
cation performance of single base classifiers. It is well known (Duin and Tax (2000), 
Kuncheva (2004), Koltchinskii et al. (2004)), that the results of such combiners can 
be both better or worse than expensively trained single models and also that com- 
biners can be superior when used on relatively sparse empirical data. In general, 
as the base models are less powerful (and inexpensive to produce), their combin- 
ers tend to yield much better results. However, this advantage is decreasing with the 
quality of the base models (e.g. Duin and Tax (2000)). Our past credit scoring single- 
S VM classifiers concentrate on misclassification performance obtainable by different 
SVM kernels, different input variable subsets and financial operating characteristics 
(Schebesch and Stecking (2005a,b), Stecking and Schebesch (2006), Schebesch and 
Stecking (2007)). In credit scoring, classifier combination using such base models 
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may be very useful indeed, as small improvements in classification accuracy mat- 
ter especially in the case of unbalanced (e.g. with more good than bad credit clients) 
data sets and as fusing models on different inputs may be required by practice. Hence, 
the paper presents in sections 2 and 3 model combinations with base models on all 
available inputs using single classifiers with six different kernels for unbalanced data 
sets, and finally in section 4 S VM model combinations of base models on randomly 
selected input subsets using the same kernel classifier placing some emphasis on 
correcting overtraining which may also result from model combinations. 



2 SVM models for unbalanced data sets 

The data set used is a sample of 658 clients for a building and loan credit with a 
total number of 40 input variables. This sample is drawn from a much larger popu- 
lation of 17158 credit clients in total. Sample and population do not have the same 
share of good and bad credit clients: the majority class is undersampled (drawn less 
frequently from the poulation than the opposite category) to get a more balanced 
data set. In our case the good credit clients share 93.3% of the population, but only 
50.9% of the the sample. In the past, a variety of SVM models were constructed in 
order to forecast the defaulting behavior of new credit clients, but without taking into 
account the sampling bias systematically. For balanced data sets SVM with six dif- 
ferent kernel functions were already evaluated. Detailed information about kernels, 
hyperparameters and tuning can be found in (Stecking and Schebesch (2006)). 

In case of unbalanced data sets the SVM approach can be described as follows: 
Let /<- (x) = {^k{x),Wk)+ bfr be the output of the kth SVM model for unknown pattern 
X, with a constant, <I>j. the (usually unknown) feature map which lifts points from 
the input space X into feature space fF, hence <I> : X ^ The weight vector Wk is 
defined by wt = ^i0.iyi^k{xi) with a,- the dual variables (0 < a,- < C(y,)), and y,- be 
the binary output of input pattern x. For unbalanced data sets the usually unique upper 
bound C for a, is replaced by two output class dependent cost factors C(— 1) and 
C(-|-l). Different cost factors penalize for example false classified bad credit clients 
stronger than false classified good credit clients. Note also, that (<I>i;(x),<I>i:(x,)) = 
K{x,Xi), where X is a kernel function, for example K{x,Xi) = exp (— 5||x — x,jp), i.e. 
the well known RBF kernel with user specified kernel parameter s. 

Multiple SVM models and combination 

In previous work (Schebesch and Stecking (2005b)) SVM output regions were de- 
fined in the following way: (1) if |/a:(x)| > 1, then x is called a typical pattern with 
low classification error, (2) if \fk{x) \ < 1, then x is a critical pattern with high classi- 
fication error. Combining SVM models for classification we calculate sign {J2ky*k) 
with yl = 4-1 while fk{x) > 1 and y| = —1 while fk{x) < — 1, which means: SVM 
model k has zero contribution for its critical patterns. For illustrative purpose we 
combine two SVM models (RBF and second degree polynomial) and mark nine re- 
gions (see figure 1): typical/typical regions are I, III, VII, IX, critical/critical 
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classification 



correct 
" false 



Radial Basis Function (RBF) 

Fig. 1. Combined predictions of SVM with (i) polynomial kernel K{x,Xi) = {{x,Xi) +5)^ and 
(ii) RBF kernel K{x,Xi) = exp (—0.05 ||jc — jc, |p) . Black (grey) boxes represent false (correct) 
classified credit clients. Nine regions (I-IX) are defined s.t. the SVM output of both models. 



region is V and typical/critical regions are II, IV, VI, VIII. Censored classifi- 
cation uses only typical/typical regions (with a classification error of 10.5 %) and 
typical/critical regions (where critical predictions are set to zero) with a classifica- 
tion error of 18.8 %. For the critical/critical region V no classification is given, as 
the expected error within this region would be 39.7 %. For this combination strat- 
egy the number of unpredictable patterns is quite high (360 out of 658). However, 
by enhancing the diversity and by increasing the number of SVM models used in 
combinations, the number of predictable patterns will also increase. 0.1cm2.4mm 



3 Multiple SVM for unbalanced data sets in practice 

Table 1 shows the classification results of six single SVM and three multiple SVM 
models using tenfold cross validation. Single models are built with the credit scor- 
ing data sample of 658 clients using SVM kernel parameters from (Sleeking and 
Schebesch (2006)) and varying cost factors C(-Fl) = 0.3 x C(— 1) from (Schebesch 
and Sleeking (2005a)), allowing for higher classification accuracy towards good 
credit clients. The classification results obtained are weighted by w = for good 
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Table 1. Tenfold cross validation performance (weighted) for six single models and three 
combination models, g-means metric is used to compare classification performance of models 
with unbalanced data sets. 



SVM 


Good 


Bad 


Bad 


Good 


g-means 


model 


rejected 


accepted 


rejected 


accepted 


Total metric 



Linear 


1195 


619 


530 


14814 


17158 


0.653 


Sigmoid 


0 


1142 


7 


16009 


17158 


0.077 


Polynomial 
(2nd degree) 


96 


1035 


114 


15913 


17158 


0.314 


Polynomial 
(3rd degree) 


1242 


633 


516 


14767 


17158 


0.643 


RBF 


860 


651 


498 


15149 


17158 


0.640 


Coulomb 


0 


1124 


25 


16009 


17158 


0.148 


Ml 


287 


850 


299 


15722 


17158 


0.505 


M2* 


191 


331 


256 


12425 


13203 


0.655 


M3** 


3548 


331 


818 


12425 


17158 


0.743 



* 3955 credit clients could not be classified. 

* Default class bad for 3955 credit clients. 

and by w = for bad credit clients to get an estimation of the “true” population 
performance for all models. For unbalanced data sets, where one class dominates 
the other, the total error (= sum of Good rejected and Bad accepted divided through 
the total number of credit clients) is not an appropriate measure. The g-means met- 
ric has been favored by several researchers (Akbani et al. (2004), Kubat and Matwin 
(1997)) instead. If the accuracy for good and had cases is measured separately as 
a+ = good accepted / (good accepted + good rejected) and = bad rejected / 
(bad rejected + bad accepted), respectively, then the geometric mean for both ac- 
curacy measures is g = \/a^ -a^ . The g-means metric tends to favor a balanced 
accuracy in both classes. It can he seen from table 1, that in terms of g-means metric 
SVM with linear (g = 0.653), polynomial third degree (g = 0.643) and RBF kernel 
(g = 0.640) dominate the single models. Additionally, three multiple models are sug- 
gested: Ml (g = 0.505) combines the real output values of the six single models and 
takes sign for class prediction. M2 {g = 0.655) only uses output val- 

ues with \fk{x) I > 1 for combination, leaving 3955 credit clients with Vk : \fk{x) \ < 1 
unclassified (see also critical/critical region V from figure 1). It is strongly suggested 
to use refined classiheation functions (e.g. with more detailed credit client informa- 
tion) for these cases. Alternatively, as a very simple strategy, one may introduce M3 
(default class comhlnation) instead. Default class bad (rejecting all 3955 unclassified 
credit clients) leads to g = 0.743 which is highest for all models. 
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4 Combination of SVM on random input subsets 

Previous work on input subset selection from our credit client data (Schebesch and 
Stecking (2007)) suggests that SVM models using around half or even less of the 
full input set can lead to good performance in terms of credit client misclassifica- 
tion error. This is especially the case when the inputs chosen for the final model are 
determined by the rank their contribution to above average base models from a pop- 
ulation of several hundred models using random input subsets. Now we proceed to 
combine the outputs of such reduced input base SVM models. Input subsets are sam- 
pled randomly from rn = 40 inputs, leading to subpopulations with base models using 
r = 5,6,...,35 input variables respectively. For each r we draw 60 independent input 
subsets from the original m inputs, resulting in a population of 31 x 60 = 1860 base 
models. These differently informed (or weak) base models are trained and validated 
by highly automated model building with a minimum of SVM parameter variation 
and with timeout, hence we expect them to be sub-optimal in general. The real valued 
SVM base model outputs (see also previous sections) (-^) G now indexed 

such that (rj) is the yth sample with r inputs. These outputs are used twofold: 

• fusing them into simple (additive) combination rules, and 

• using them as inputs to supervised combiners. 

A supervised combiner can itself be a SVM or some other optimization method 
(Koltchinskii et al. (2004)), using some {f[rj ) } ns inputs and the original data labels 
y e { — 1,-|-1}^ as reference outputs. Potential advantages of random input subset 
base model combination are expected to occur for relatively sparse data and to di- 
minish for large N. As our N = 658 cases are very few in order to reliably detect any 
specific nonlinearity in rn = 40 input dimensions, our data are clearly sparse. Using 
combiners on outputs of weak base models may easily outperform all the base mod- 
els but it may also conceal some overtraining. In order to better understand this issue 
we evaluate combiners on two sets Sa and Sb of base model outputs. Set Sa contains 
the usual output vectors of trained and validated base models when passed through 
a prediction seep over .r, i.e. {f[rj)} for all {rj). At first set Sb is initialized to Sa- 
Then it is corrected by the validation sweep which includes all misclassifications of 
a model occurring when passed through the leave-one-out error validation. If 
is the output of model {rj) at case i e {1, and if this base model is effectively 

retrained on data (x,y) but without case i, then Sa and Sb may differ at entry {rj)i. 
This especially includes the cases when f({^pif{rj)i < 0 and therefore may also contain 
new misclassifications < 0. Hence combiners on subsets of Sb should lead to 

a more conservative (stringent) prediction of expected out-of-sample performance. 
We first inquire into the effects of additive output combination on misclassification 
error using both sets Sa and Sb- The following simple rules (which have no tunable 
parameters) are used: (1) LINCUM, which adds the base model outputs as they occur 
during the sweep over input numbers r, i.e. finally combining all the 1860 base mod- 
els, (2) 160, which adds the outputs of base models with the same number of inputs 
r respectively, producing 31 combiners, and (3) E3, adding the outputs of the three 
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best (elite) base models from each sorted list of the 1-1 -o errors. This again leads to 
31 combiners. The experiments shown in fig. 2 (Ihs plot) indicate that, when used on 




Fig. 2. Each inset plots the number of inputs r of base models against percent misclassification 
error of base models and of combiners. Lhs plot: training and validation errors of base models 
and errors of their simple combiners on set Sa - Rhs plot: validation errors of base models (from 
lhs plot, for comparison) and errors of simple combiners on Set Sg- For further explanation 
see the main text. 



outputs from set Sa, these simple combiners consistently outperform the minimum 
1-1-0 errors from the respective base model subpopulation, with 160 having the best 
results on outputs of base models which use small input subsets. However the level 
of 1-1-0 misclassification errors seems to be very low, especially when compared to 
the errors obtainable by extensively trained full input set models (i.e. around 23- 
24%) from previous work. Hence, in fig. 2 (rhs plot) we report the 1-1 -o errors for the 
same simple combiners when using outputs from set Sg- Now the errors clearly shift 
upwards, with the bulk of combiners within the error corridor of extensively trained 
full input models (E-corridor, the shaded area within the plots). A benefit still exists 
as the errors of the combiners remain in general well below (only in some cases very 
near to) the minimum 1-1-o-errors of the respective base models populations. With 
increasing r rule LINCUM relies on information increasingly similar to what a full 
input set model would see. Hence, it should (and for the set Sg it actually is) tending 
towards an error level within the E-corridor. Next, the outputs of the base models 
used by the combiners E3 and 160 are now used by 3 1 supervised combiners respec- 
tively (also SVM models with RBF-kernels, for convenience). These models denoted 
by SVM(E3) and SVM(I60) are then trained on subsets from Sa and subjected to 1- 
1-0 error validation (fig. 3, lhs plot). Compared to the simple combiners they display 
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Still further partial improvement, but with (validated) errors very low for bigger in- 
put subsets (r > 16) that this now appears quite an improbable out-of-sample error 
range prediction for our credit scoring data. Training and validating SVM(E3) and 
SVM(I60) on subsets from Sb instead (fig. 3, rhs plot) remedies the problem. Most of 
the 1-1-0 errors are shifted back into the E-corridor. In this case there is no advantage 
for SVM(E3) on Sb over simple E3 on Sb and also no improvement for S VM(I60) on 
Sb for r > 16. However, for small r SVM(I60) on Sb seems to predict the E-corridor. 
Eor somewhat bigger r this is in fact also the case for simple rule 160 on Sa and for 
SVM(I60) on Sa - Note that combination procedures with such characteristics can be 
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Fig. 3. Axes description same as in fig 2. Lhs plot: training and validation errors of supervized 
combiners on set Sa - Simple rule LINCUM from 2 (lhs) for comparison. Rhs plot: training and 
validation errors of supervized combiners on set Sb- Simple mle LINCUM from 2 (rhs) for 
comparison, of the validated base models. For further explanation see the main text. 



very useful for problems with large feature dimensions which contains deeply hidden 
redundancy in the data, i.e. which cannot be uncovered by explicit variable selection 
(sometimes this is addressed by manifold learning)- Censored outputs as described 
in the previous sections can be easily included. Also note that there is no tuning and 
no need for other complementary procedures like, for instance, combining based on 
input or output dissimilarities. 



5 Conclusions and outlook 

In this paper we examined several combination strategies for SVM. A simple addi- 
tion of output values leads to medium performance. By including region information 
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the classification accuracy (as measured by g-means metric) rises considerably, while 
for a number of credit clients no classification is possible. For unbalanced data sets 
we propose to introduce default classes to overcome this problem. Finally, simple 
combinations of outputs of weak base SVM classifiers on random input subsets yield 
misclassification errors comparable to extensively trained full input set models and 
they also seem to improve on them. Training and validating supervised combiners 
on the outputs of the base models seems to confirm this result. Flowever combiners 
also tend to overtrain! The more appropriate way of using combiners is to use base 
model outputs corrected by the validation sweep. Many of these combiners are at 
least as good as the full input set models, even on base model subpopulations formed 
by small random input subsets. We suspect that such reduced input model combin- 
ers behave similarly for other data as well, as long as these data still contain hidden 
association between the inputs (which is quite plausible for empirical data sets). 
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Abstract. Recommender systems are used by an increasing number of e-commerce websites 
to help the customers to find suitable products from a large database. One of the most popular 
techniques for recommender systems is collaborative filtering. Several collaborative filtering 
algorithms claim to be able to solve i) the new-item problem, when a new item is introduced 
to the system and only a few or no ratings have been provided; and ii) the user-bias problem, 
when it is not possible to distinguish two items, which possess the same historical ratings 
from users, but different contents. However, for most algorithms, evaluations are not satisfying 
due to the lack of suitable evaluation metrics and protocols, thus, a fair comparison of the 
algorithms is not possible. 

In this paper, we introduce new methods and metrics for evaluating the user-bias and new- 
item problem for collaborative filtering algorithms which consider attributes. In addition, we 
conduct empirical analysis and compare the results of existing collaborative filtering algo- 
rithms for these two problems by using several public movie datasets on a common setting. 



1 Introduction 

A Recommender system is a type of customization tool in e-commerce that gener- 
ates personalized recommendations, which match with the taste of the users. Col- 
laborative filtering (CF) (Sarwar et al. (2000, 2001)) is a popular technique used in 
recommender systems. It is used to predict the user interest for a given item based on 
user prohles. The concept of this technique is that the user, who received a recom- 
mendation for some sorts of items, would prefer the same items as other individuals 
with a similar mind set. 

However, besides its simplicity, one of the shortcomings of CF are the new-item 
or cold-start problem. If no ratings are given for new items, it is difficult for standard 
CF algorithms to determine their own clusters by using rating similarity and thus they 
fail to give accurate predictions. Another problem is the user-bias from historical rat- 
ings (Kim and Li (2004)), which occurs when two items, based on historical ratings 
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Fig. 1. User-Bias Example 



have the same opportunity to he recommended to a user, but additional information 
shows that one item belongs to a group which is preferred by the user and the other 
not. For example, as shown in Figure 1 , by applying CF, the probabilities that item 
4 and 5 to be recommended for user 1 are equal. When the attributes are also taken 
into consideration, it can be observed that items 1, 3 and 6 which belong to attribute 
1 are rated higher than user 1 than item 2 which belongs to attribute 2. Thus, user 
1 has a preference for items related to attribute 1 over items related to attribute 2. 
Subsequently, by the CF algorithm, a higher probability should be assigned to item 
5, which is more attached to attribute 1, than to item 4, which is related to attribute 
2 . 

Recommender system algorithms that incorporate attributes claim to solve the 
user-bias and the new-item problem, however, no good evaluation techniques ex- 
ist. For that reason, in this paper, we make the following contributions: (i) we in- 
troduce new methods and metrics for evaluating these problems and (ii) through a 
common experimental setting, we present evaluation results for three existing CF al- 
gorithms, which do not take attributes into account, namely user-based CF (Sarwar 
et al. (2000)), item-based CF (Sarwar et al. (2001)) and Gaussian aspect model by 
Hofmann (2004) as well as an approach, which takes attributes into account, by Kim 
& Li (2004). In the next section, we present the related work. In section 3, a brief 
description of the aspect model by Hofmann and the approach by Kim & Li will 
be presented. An introduction of the evaluation techniques for the new-item and the 
user-bias problem will follow in section 4. Section 5 consists of results on the em- 
pirical evaluations we have conducted and in section 6 we present the conclusions of 
the results and discuss possible future work. 



2 Related works 

Evaluating CF algorithms is not anything novel as there have already been relatively 
standard measures for evaluating the CF algorithms. Most of the evaluations done 
on CF focus on the overall performance of the CF algorithms (Breese et al. (1998), 
Sarwar et al. (2000), Herlocker et al. (2004)). However, as mentioned in the pre- 
vious section, CF suffers from several shortcomings which are the new-item prob- 
lem, also known as the cold-start problem, as well as the user-bias problem. It has 
been claimed that incorporating attributes could help to alleviate these drawbacks 
(Kim and Li (2004)). In fact, there exist many approaches for combining content 
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information with CF (Burke (2002), Melville et al. (2002), Kim and Li (2004), Tso 
and Schmidt-Thieme (2005)). However, there has been lack of suitable evaluations 
which compute comparative analysis of attribute-aware and non attribute-aware CF 
algorithms, focusing on these two problems. 

Schein et al. (2002) have already discussed methods and metrics for the new- 
item problem, in which they have introduced a performance metric called CROC 
curve. However, this metric is only suitable for the new-item problem. In this paper, 
we use standard performance metric, but introduce new protocols for evaluating the 
new-item and the user-bias problems. Hence, this evaluation setting allows users to 
compare the results with standard CF evaluation metrics, which does not restrict to 
evaluate only the new-item problem, but also on the user-bias problem. In addition, 
we compare the predicting accuracy of various collaborative filtering algorithms in 
this evaluation setting. 



3 Observed approaches 

In this section, we present a brief description of the two state-of-the-art CF models: 
the aspect model by Hofmann (2004) and the approach by Kim & Li (2004). 



Aspect model by Hofmann 



Hofmann (2004) specified different versions of the aspect model regarding the col- 
laborative filtering domain. In this paper, we focus on the Gaussian model, because 
it shows the best prediction accuracy for non-specific problems. He uses the aspect 
model to identify the hidden semantic relationship among item y and users u, by us- 
ing a latent class variable z, which represents the user clusters associated with each 
observation pair of a user and an item. In the aspect model, the users and items are 
considered as independent from each other and every observation can be described 
by a quartet < u,y,v,z >, where v denotes the rating user u has given to item y. For 
every observation quartet, the probability is then computed as follows: 



P{u,y,v,z) =P{v\y,z) P{z\u) P{u) 

The focus of our evaluation in this paper is on the Gaussian pLSA model, in 
which P{v\y, z) is represented by the Gaussian density function. In the gaussian pLSA 
model, every combination of z and an item y has a location parameter j. and a scale 
parameter O), ^. The probability of the rating, v is then: 



P{v\y,z) = 



'\/ '2.7ZCJy^^ 



-exp 



i^-Py,z) 

2ry7 

y,z 



2 



As z is unobserved, Hoffmann used the Expectation Maximization (EM) algo- 
rithm to learn the two model parameters: P(v|y,z) and P{z\u). The EM algorithm has 
two main steps. The first step is computation of the Expectation (E-Step), which is 
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done by computing the variation distribution Q over the latent variable z. The second 
step is Maximization (M-Step), in which the model parameters are updated by using 
the Q distribution computed in the previous E-Step. These two steps are executed un- 
til it converges to a local optimal limit. The EM steps for the Gaussian pLSA model 
are: 



E-Step: 



Q{z;u,y,v,Q) 






M-Step: 



P(z|m) 



Ez' H<u',y>-.u'=u Q{z';u',y, V, 0) 



The location and scale parameters would also have to he updated. 

Analogously, the same model can be applied by representing the latent class vari- 
able z, not as the user communities but as item cluster. 



Approach by Kim and Li 

The approach by Kim & Li (2004) seeks to solve the problem of user-bias and 
the new-item with the help of Item attributes. They have incorporated attributes of 
movies such as genre, actors, years, etc. to collahorative filtering. It is expected that 
when attributes are considered, it is possible to recommend a new item based on just 
the user’s fondness of the attributes, even though no user has voted for the item. 

Kim & Li have a rather similar model as the aspect model by Hoffmann, yet 
there are several differences. Eirst, class z associates only with the item, hut not with 
the users in contrast to the pLSA model by Hofmann. Note that, the latent class z in 
this approach is regarded as an item clusters, instead of the user communities. Eur- 
thermore, they have applied some heuristic techniques to compute the corresponding 
model parameters, which can be done in two steps. Eirst, using attributes, they clus- 
tered the Items in different cliques with a simple K-means clustering algorithm. After 
clustering the items, they computed the probability of every item, i.e. the value indi- 
cating how much the item belongs to every clique. Then, an item-clique matrix with 
all the probabilities is derived. In the second step, the original item-user matrix is ex- 
tended with the item-clique matrix, thus the attribute-cliques are just used as normal 
users. 

Class z is built with the help of the extended item-user matrix. Every class z con- 
sists of a number of items of high similarity. The quality of class z is responsible 
for the accuracy of the later prediction of the use vote. A K-Medoids clustering algo- 
rithm using the Pearson’s Correlation is used to compute the classes. After clustering 
the items into class z, a new item for each class z is created using the arithmetic mean. 
This new item is then the representative vector of the class z- 

With the help of these representative items and a group matrix, which stores the 
membership of every item of the item-user matrix, it is possible to compute the ex- 
pected vote for a user. In calculating the prediction, it is assumed that class z satisfies 
the Gaussian distribution. Let Vy be the rating vector of item y, the representative 
vector of item cluster z, ED{-) the Euclidean distance, Vu^y the user m’s vote on item 
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y and the set of items, which are in the same item cluster z, then the membership 
degree p{z\y) and the mean rating, /r„ of user u on class z can be calculated as 
follows: 

E5.,i/£D(v„r,) E««p(zb) 

4 Evaluation protocols 

New-item problem 

To evaluate the prediction accuracy, we use a protocol which deletes one vote ran- 
domly from every user in the dataset, the so-called, AllButl protocol (Breese et al. 
1998). The new-item problem is evaluated by a protocol similar to the AllButl pro- 
tocol. Likewise, this protocol also deletes existing votes and builds up the model, 
which is to be evaluated with the reduced dataset. The new items are created by 
deleting all votes for a randomly selected item. After this is done for the required 
number of items, one vote is deleted from each user as in the AllButl protocol. This 
protocol has the advantage that the results of the new items can be compared with 
the results for past-rated items. Mean Absolute Error (MAE) is used as metrics in 
our experiments. 

User-bias problem 

The user-bias problem occurs, when two items have the same rating, but one item be- 
longs to a group of items, which have not been given a good vote by the user, whereas 
the other item belongs to a group, which was in contrast given a good vote by the 
user; then the item, which belongs to the good-rated group, should be recommended. 

To hnd a pair of items for an user, all the items, which are rated by the user, 
are taken into consideration and grouped two times. Once in item groups with equal 
rating and the second time in items groups with equal attributes. The historical vote 
vectors of these pairs of items of the users are then compared, excluding the vote of 
the observed user. In the next step, we select all pairs of items, which are in the same 
group of equally rated items and different group of attributes. One pair, which is to 
be predicted, is randomly chosen and deleted from the dataset. This is then done for 
all users in the database. 

For each of these ‘user-biased’ pairs, the vote prediction for these pairs are com- 
puted and compared with the four collaborative filtering algorithms we use in our 
experiments. MAE metric is used to evaluate the predicting accuracy. 



5 Evaluation and experimental results 

Two datasets are used for our experiments - the EachMovie, containing 2,558,871 
votes from 61,132 users on 1,623 movies, and the MovieLenslOOk dataset, contain- 
ing 100,000 ratings from 943 users on 1,682 movies. The datasets also contain genre 
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information for every movie in binary presentation, which we used as attributes. The 
EachMovie dataset contains 10 different genres, MovieLens contains 18. We con- 
duct for both datasets 10 samples, in which 10 trials were run. For each sample 1500 
movies are selected, whereas a 1000 users in EachMovie and 600 users in Movie- 
Lens are selected, and 20 neighbours for MovieLens and EachMovie for both user- 
and the item-based CF. No normalization is used in the aspect model and z is set 
to 40 for both datasets. In the Kim & Li approach, we used 20 attribute-groups and 
40 item clusters for both datasets. We have selected the above parameter settings, 
because they were reported as the parameters which have given the best results in 
former experiments by the corresponding authors. 

At first, we compared four observed approaches, namely the user-based CF, item- 
based CF, aspect model and Kim & Li approach, using the AllButl protocol. In 
Figures 2 and 3, the aspect model performs the best, the approach by Kim & Li is 
only slightly worse, while the user- and item-based CF algorithms perform the worst. 




User*Based Item*Based Aspect Model K & L mac User-Based Item-Based Aspect Model K & L 
CF CF GpLSA CF CF GpLSA 



Fig. 2. AllButl using EachMovie. Fig. 3. AllButl using MovieLens. 



New-item problem 

The results of the new-item problem are presented in Figure 4 and 5. Comparing the 
performance achieved by the algorithms, which use no attributes and the Kim & Li 
approach, we can see that the performance of the Kim & Li approach is only negli- 
gibly affected when more new items are added, while the predicting accuracy of the 
other approaches becomes much worse. This phenomenon is in line with our expec- 
tations, because it is not possible for algorithms, which do not take the attributes into 
account, to find any relations between new items and already rated items. As for the 
Kim & Li approach, there is no difficulty to assign an unrated item to an item cluster, 
because it includes the attributes. The average standard deviation is about 0.03. 

User-bias problem 

In the experiments of the user-bias problem, the number of items for prediction is 
between 60 to 70% of the total number of items, which is a representative amount. 
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Fig. 4. New-Item using EachMovie. Fig. 5. New-Item using MovieLens. 




User-Based Item-Based Aspect Model K & L User-Based Item-Based Aspect Model K & L 

CF CF GpLSA CF CF GpLSA 

Fig. 6. User-Bias using EachMovie. Fig. 7. User-Bias using MovieLens. 



Besides, as shown in Figures 6 and 7, our expectations are confirmed. Only the ap- 
proach by Kim & Li can mine the difference between two items with the same his- 
torical rating, but belong to different attributes; while the other approaches do not 
have any possibility to find the type of items the user likes because they do not take 
attributes into consideration. It is interesting to see that the aspect model, which per- 
forms best in general, performs worst to the user- and item-based CF when special 
problems such as the user-bias and new-item problem are considered. 



6 Conclusion 

The aim of this paper is to show that the new-item problem and user-bias problem 
can be solved with the help of attributes. We have used three CF algorithms, which do 
not use any attributes, and one approach, which takes the attribute information into 
account to compute the recommendations in our evaluation. Our evaluations have 
shown that it is possible to solve the new-item problem and user-bias problem with 
the help of attributes. In general, the approach by Kim & Li can not surpass the aspect 
model, but it can solve specific problems of new-item and user-bias more effectively. 
Especially for the new-item problem, where in the reality it is not uncommon to have 
30-50 new items being injected to the database. Hence, we can conclude that by 



532 Stefan Hanger, Karen H. L. Tso and Lars Schmidt-Thieme 



applying the right algorithms to the right cases, we can improve the recommendation 
quality rather significantly. 

It can be seen that a small number of attributes could already help to overcome the 
problem of new-item and user-bias, then it should be possible to improve the results 
further with more adequate attributes. For future work, it would be interesting to find 
out, how to select better attributes, and how the attributes affect the performance. 
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Abstract. With the increasing popularity of collaborative tagging systems, services that as- 
sist the user in the task of tagging, such as tag recommenders, are more and more required. 
Being the scenario similar to traditional recommender systems where nearest neighbor algo- 
rithms, better known as collaborative filtering, were extensively and successfully applied, the 
application of the same methods to the problem of tag recommendation seems to be a natural 
way to follow. However, it is necessary to take into consideration some particularities of these 
systems, such as the absence of ratings and the fact that two entity types in a rating scale corre- 
spond to three top level entity types, i.e., user, resources and tags. In this paper we cast the tag 
recommendation problem into a collaborative filtering perspective and starting from a view 
on the plain recommendation task without attributes, we make a ground evaluation comparing 
different tag recommender algorithms on real data. 



1 Introduction 

The process of building the Semantic Web (Berners-Lee et al. 2001) is currently 
an area of high activity. Both the theory and technology to support it have been al- 
ready defined and now one must fill this structure with life. In spite of the sounding 
simplicity, this task actually represents the biggest challenge towards Its realization, 
i.e., adding semantic annotation to Web documents and resources in order to pro- 
vide knowledge access Instead of unstructured material. Annotation represents an 
extra effort which certainly will not be voluntarily done without good reasons. In 
this sense, it is necessary to incentive and educate the user into this practice, e.g., 
showing the benefits that can be achieved through it and alleviating the extra bur- 
den with the recommendation of relevant annotations. With the recent appearing and 
increasing popularity of the so called collaborative tagging systems this is finally 
possible (Golber et al. (2005)). 

Recommending tags can serve various purposes, such as: increasing the chances 
of getting a resource annotated (or tagged) and reminding a user what a resource 
is about. Furthermore, lazy annotating users would not need to come up with a tag 
themselves but just select the ones readily available in the recommendation list ac- 
cording to what they think is more suitable for the given resource. 
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Tag recommender systems recommend relevant tags for an untagged user re- 
source. Relevant here can assume different perspectives, for example, a tag can be 
judged relevant to a given resource according to the society point of view, through 
the opinion of experts in the domain or even based on the personal prohle of an indi- 
vidual user. The question would be, which concept of relevance would the user prefer 
the most when using tag recommender services. This paper attempts to address this 
question through the following contributions: (i) formulation of the tag recommenda- 
tion problem and the introduction of a collaborative filtering-based tag recommender 
algorithm, (ii) presentation of a simple protocol for tag recommender evaluation (iii) 
and (iv) a ground and quantitative evaluation on real-life data comparing different 
tag recommender algorithms. 



2 Related work 

The literature regarding the specihc problem of collaborative tag recommendation 
is still sparse. The majority of the recent research work about collaborative tagging 
systems and folksonomies is concerned in devising approaches to better structure the 
data for browsing and searching where the recommendation problem is sometimes 
only highlighted as a potential property to be further explored in future work (Mika 
(2005), Hotho et al. (2006), Brooks and Montanez (2006), Heymann and Garcia- 
Molinay (2006)). We briefly describe below the works specifically investigating the 
problem of collaborative tag recommendation. 

Autotag (Mishne (2006)) is a tool that suggests tags for weblog posts using col- 
laborative filtering methods. Given a new weblog post, posts which are similar to it 
are identified through traditional information retrieval similarity measures. Next, the 
tags assigned to these posts are aggregated creating a ranked list of likely tags. De- 
spite the collaborative filtering scenario, there is no real personalization because the 
user is not taken directly into account. Furthermore, the evaluation is done in a semi- 
automatically fashion where the assumption of tag relevance for a given resource is 
defined to some extent by human experts. 

Xu et al. (2006) introduce a collaborative tag suggestion algorithm based on a set 
of general criteria to identify high quality tags. Some of the considered criteria are: 
high coverage of multiple facets to ensure good recall, least effort to reduce the cost 
involved in browsing, and high popularity to ensure tag quality. A goodness measure 
for tags, derived from collective user authorities, is iteratively adjusted by a reward- 
penalty algorithm, which also incorporates other sources of tags, e.g., content-based 
auto-generated tags. There is no quantitative evaluation. 

Benz et al. (Benz et al. (2006)) introduce a collaborative approach for book- 
mark classification based on a combination of nearest-neighbor-classifiers. Two sep- 
arate kinds of recommendations are generated: Keyword recommendations on the 
one hand, i.e. which keywords to use for annotating a new bookmark, and a recom- 
mendation of a classification on the other hand. The keyword recommender can be 
regarded as a collaborative tag recommender but its just a component of the overall 
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algorithm, and therefore there is no information about its effectiveness as a stand- 
alone tool. 

The state-of-the-art tag recommenders in practice are services that provide the 
most-popular tags used by the society for a particular resource (Fig. 2). This is usu- 
ally done by means of tag clouds where the most frequently used tags are depicted 
in a larger font or otherwise emphasized. 

The approaches described above address important aspects of the problem, but 
there is still a lack regarding quantitative evaluation on basic tag recommender al- 
gorithms. Furthermore, there is no common or agreed protocol where the different 
algorithms should be compared. 



3 Recommender Systems 



Recommender systems (RS) recommend products to customers based on ratings or 
past customer behavior. In general, RS predict ratings of items or suggest a list of 
unknown items to the user. They usually take the users, items and the ratings of 
items into account. A recommender system can be briefly formulated as: 

• A set of users U 

• A set of items I 

• A set S C R of possible ratings where r:f/x/^Sisa partial function that 
associates ratings to user/item pairs. In datasets r typically is represented as a list 
of tuples (m, i, r{u, i)) with uGU,iG I and r defined for the domain domr CU xl 

• Task: In recommender systems the recommendations are for a given user u GU 
a set 7 (m) C / of items. Usually I{u) is computed by first generating a ranking 
on the set of items according to some quality or relevance criterion, from which 
then the top n elements are selected (see Eq. 2 below). 



In CF, for m users and n items, the user profiles are represented in a user-item 
matrix X G The matrix can be decomposed into row vectors: 

X:= withx„ := for m := 

where Xuj indicates that user u rated item i by ,■ € R. Each row vector x„ corre- 
sponds thus to a user profile representing the item ratings of a particular user. This 
decomposition leads to user-based CF. 

The matrix can alternatively be represented by its column vectors: 

X := [xi,...,x„,] withx; := [xiA,-,Xi^mV , for i := 



where each column vector x, corresponds to a specific item’s ratings by all m users. 
This representation leads to item-based recommendation algorithms. 

The pairwise similarities between users is usually computed by means of vector 
similarity: 



^^m(prof„,profJ 



(prof^.profy) 

II Prof„ nil prof^ II 



where u,vGU are two users and profu and profv are their profile vectors. 



( 1 ) 
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Let B C / be the basket of items of the active user u Q U and Nu his/her best- 
neighbors. The topN recommendations usually consists of a list of items ranked by 
decreasing frequency of occurrence in the ratings of the neighbors: 

~ n 

I{u) := argmax|{v G | i G ryj}\ (2) 

ie I 

where Br\I{u) := 0 and n is the size of the recommendation list. 

The brief discussion above refers only to the user-based CF case, since it is the 
focus of our work. Moreover, we consider only the recommendation task since in 
collaborative tagging systems there are no ratings and therefore no prediction. For a 
detailed description about the item-based CF algorithm see Deshpande et al. (2004). 



4 Tag Recommender Systems 

Tag recommender systems recommend relevant tags for a given resource. As already 
discussed in section 1, the notion of relevance here can assume different perspectives 
and is usually hard to judge what concept of relevance would be preferable to a 
particular user. Collaborative tagging systems usually allow the users to see the most 
popular tags used for a given resource. This can be thought of a social-based tag 
recommender service since it represents the society opinion as a whole. Through 
CF we can measure the extent to which personalized notions of tag relevance are 
preferable in comparison with the socialized ones. 

Collaborative tagging systems are usually composed of users, resources and tags 
and allow users to assign tags to resources. What is considered a resource depends on 
the type of the system, e.g. URLs (del.icio.us^), pictures (Flickr^), music(Last.fm^), 
etc. A tag recommender system can be formulated as follows: 

• A set of users U 

• A set of resources R 

• A set of tags T 

• A function s :U x T associating tags to user/resources pairs, where T CT 
and s is defined for the domain donig CUxR 

• Task: In tag recommender systems the recommendations are for a given user 
u G U and a resource r G R a set T{u,r) C T of tags. As well as in the tradi- 
tional formulation (section 3), f{u, r) can also be computed by first generating a 
ranking on the set of tags according to some quality or relevance criterion, from 
which then the top n elements are selected (see Algo.l below). 

When comparing the formulation above with the one in section 3, we observe that 
CF cannot be applied directly. This is due to the additional dimension represented by 



* http://del.icio.us 
^ http://www.flickr.com 
^ http://www.last.fm 
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T. Either we use more complex methods do deal directly with it or reduce it to a 
lower dimensional space where we could apply CF. We follow the latter one. 

To this end we take all the two dimensional projections of the original matrix 
preserving the user information. Letting K := \U\, M := \I\ and L:= |r|, the pro- 
jections result in two user profile matrices: a user-resource K x M matrix X and a 
user-tag Kx L matrix Y. In collaborative tagging systems there is usually no rating 
information. The only information available is whether or not a resource and/or a tag 
occurred with the user. This can be encoded in the binary matrices X G {0, and 
Y G {0, 1}^^^ indicating occurrence, e.g. = 1 and y/^ i = 1, or non-occurrence of 
resources and tags with the users. Now we have the required setup to apply collabo- 
rative hltering. 

The algorithm starts selecting the users who have tagged the resource in question. 
Next, the pairwise similarity computation is performed (Eq.l). Notice that now we 
have two possible setups in which the neighborhood can be formed, either based on 
the profile matrix X or Y. The neighborhood’s tags for the resource in question are 
aggregated and weighted based on the neighbors’ similarities with the active user. 
Next the weights of each particular tag are summed up and the recommendation list 
is ranked by decreasing value of the summed weights. Ties are broken by smaller 
index. The overall CF procedure for tag recommendations is summarized in Algo.l. 



Algorithm 1 CF for tag recommendations 

• Given a new and/or untagged resource r G R for the active user ugU 

• Let A := {v C {/ I ^ 0} denote the set of users who have tagged r where s is a function 
associating tags to user/resources pairs 

- Find k best neighbors: 

k 

Nu '■= argmax j-;m(prof„,profy) 
vgA 

- Output the top n tags: 

T{u,r) := argmax^ i^m(profy,prof^,)S(v, r,f) 

vgV„ 

where §(v, r,t) := I if {v, r,t) G U x R x T and 0 else. 



5 Experimental setup and results 

For our experiments we used the data made available by the Audioscrobbler^ sys- 
tem, a music engine based on a collection of music profiles. These prohles are built 
through the use of the company’s flagship product, Fast.fm, a system that provides 
personalized radio stations for its users and updates their profiles using the music 
they listen to and also makes personalized artist recommendations. In addition, Au- 
dioscrobbler exposes large portions of data through their web services API. 

^ http://www.audioscrobbler.net 
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Fig. 1. Most popular tags for a given artist 



Here we considered only the resources with 10 or more tag assignments. This 
gave us 2.917 users, 1.853 artists (playing the role of resources), 2.045 tags and 
219.702 instances ((user, resource, tag) triples). 

We evaluated four tag recommenders: (i) a most global frequent tags, which rec- 
ommend the most used tags in the sample dataset, (ii) a most popular tag by re- 
source, which recommends the most used tags for a particular resource (in our case 
an artist), (iii) a user-resource-based CF, which computes the neighborhood based 
on the user-resource matrix and (iv) a user-tag-based CF, which computes the neigh- 
borhood based on the user-tag matrix. Notice that (ii) represents the state-of-the-art 
recommender used in practice (Fig.l). 

To evaluate the recommenders we used a variant of the leave-one-out holdout 
estimation that we named leave-tags-out. The idea is to choose a resource at random 
for each user in the test set and hide the tags attached to it. The algorithm must try to 
predict the hidden tags. To count the hits made by the algorithms we used the usual 
recall measure. 



where 2) is the test set, T the true tags and Z, the predicted ones. Since the precision 
is forced by taking into account only a restricted number n of recommendations 
there is no need to evaluate precision or FI measures, i.e., for this kind of scenario 
precision is just the same as recall up to a multiplicative constant. Each algorithm was 
evaluated 10 times for n=10 (size of recommendation list) and the results averaged 



Looking at the Figure 2 we see that the most popular by resource recommender 
reached a surprisingly high recall and that the user-resource-based CF did not per- 
form significantly better than that. The good results of the most popular by resource 
algorithm can in part be explained by the fact that this service is already available by 




( 3 ) 



(Fig. 2). 
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Tag Recommender Algorithms 





MostFrequent MostFequent User UserTags 

Global byResource Based(k=100) pjg_ ^ jq 



Fig. 2. Recall of tag recommenders for n=10 



the system. Besides that, it shows the strong influence of the society’s vocabulary on 
the user’s personal opinion. In the other hand, the user-tag-based CF recommender 
performed at least 2% better^ than both the most-popular tag by resource and user- 
resource-based CF. Also notice that the improvement is consistent for different val- 
ues of n (Fig. 3). The best A:-neighbors values were estimated through successive 
runnings where k was incremented until a point where no more improvements in the 
results were observed. 



6 Conclusions 

In this paper we applied CF to the tag recommendation problem and made a quan- 
titative evaluation of its performance in comparison with other simpler tag recom- 
menders. Furthermore, we used a simple and suitable protocol with which further 
approaches can be compared. 

Despite the already good results of the baseline algorithms, the straightforward 
CF based on the user-tag profile matrix showed a significant improvement. This 
shows that users with similar tag vocabulary tend to tag alike, which indicates a 
preference for personalized tag recommendation services. 

It is also notorious the reasonable good results achieved by the most global fre- 
quent tags recommender, which indicates its adequacy for cold-start related prob- 
lems, where just a few tags are available in the system. 

In future work we plan to reproduce the same experiments with different datasets 
from different domains to confirm the results here presented. We also want to refine 
the CF algorithms exploring different combinations between the user similarities 
obtained from the two profile matrices, i.e., user-resources and user-tags. Moreover, 

^ T-test for a significance level of 0.05. 
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we will compare the CF approach with more complex models such as multi-label 
and relational classifiers. 
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Abstract. This contribution reports on the development of small sample test statistics for 
identifiying recommendations in market baskets. The main application is to lessen the cold 
start problem of behavior-based recommender systems by faster generating quality recom- 
mendations out of the first small samples of user behavior. The derived methods are applied in 
the area of library networks but are generally applicable in any consumer store setting. Analy- 
sis of market basket size at different organisational levels of German research library networks 
reveals that at the highest network level market basket size is considerably smaller than at the 
university level. The overall data volume is considerably higher. These facts motivate the de- 
velopment of small sample tests for the identification of non-random sample patterns. As in 
repeat-purchase theory the independent stochastic processes are modelled. The small sample 
tests are based on modelling the choice-acts of a decision maker completely without prefer- 
ences by a multinomial model and combinatorial enumeration over a series of increasing event 
spaces. A closed form of the counting process is derived. 



1 Introduction 

Recommender systems are lately becoming standard features at online stores. As 
shown by the revealed preference theory of Paul A. Samuelson (1948) (1938a) 
(1938b) customer purchase data reveals the preference structure of decision mak- 
ers. It is the best indicator of interest in a specific product and outperforms surveys 
with respect to reliability significantly. A behavior-based recommender system reads 
observed user behavior (e. g. purchases) as input, then aggregates and directs the 
resulting recommendations to appropriate recipients. One of the main mechanism 
design problems of behavior-based recommender systems is the cold start problem. 
A certain amount of usage data has to be observed before the first recommendations 
can be computed. Starting with recommendations drawn from almost similar appli- 
cations in general is a bad idea since it can not be guaranteed that the usage patterns 
of customers in these applications are identical. Behavior-based recommendations 
are best suited to the user group whose usage data is used to generate the very same 
recommendations. Thus, to lessen the cold start problem small sample test statistics 
are needed to faster generate quality recommendations out of the first small samples 




542 



Andreas W. Neumann and Andreas Geyer-Schulz 



of user behavior. The main problem is to determine which co-purchases occur ran- 
domly and which show a relationship between two products. In this contribution we 
apply the derived methods to usage data from scientific libraries. The methods and 
algorithms are generally applicable in any consumer store setting. For an overview 
on recommender systems e. g. see Adomavicius and Tuzhilin (2005). 



2 The ideal decision maker: The decision maker without 
preferences 

Modelling the preference structure of decision makers in a classical way leads to 
causal models which explain the choice of the decision maker, allow prediction of 
future behavior and to infer actions of the seller to influence/change the choice of 
the decision maker (e. g. see Kotler (1980)). In the library setting causal modelling 
of the preference structure of decision-makers would require the identification (and 
estimation) of such a model which explains the choice of a decision maker or of a ho- 
mogeneous group of decision makers (a customer segment) for each of the more than 
10000000 books (objects) in a library meta catalog. Solving the model identification 
problem requires selecting the subset of relevant variables out of 2^*’0*’‘**’‘^** subsets in 
the worst case in an optimal way. While a lot of research has investigated automatic 
model selection, e. g. by TheiTs F? or Akaike’s information criterion (AIC) (for fur- 
ther references see Maddala (2001) pp. 479-488), the problem is still unsolved. 

The idea to ignore interdependencies between system elements for large systems 
has been successfully applied in the derivation of several laws in physics. The first 
example is the derivation of Boltzmann’s famous H-theorem where the quantity H 
which he defined in terms of the molecular velocity distribution function behaves 
exactly like the thermodynamic entropy (see Prigogine (1962)). In the following, we 
ignore the interdependencies between model variables completely. For this purpose, 
we construct an ideal decision maker without preferences. Such an ideal decision 
maker can be regarded as a prototype of a group of homogeneous decision makers 
without preferences against which groups of decision makers with preferences can be 
tested. For a group of ideal decision makers, this is obvious, for a group of decision 
makers with preferences the principle of self-selection (Spence (1974), Rothschild 
and Stiglitz (1976)) grants homogeneity. The ideal decision maker draws k objects 
(each object represents a co-purchase (a pair of books)) out of an urn with n ob- 
jects with replacement at random and - for simplicity - with equal probability. The 
number of possible co-purchases - and thus the event space - is unknown. 

In marketing several conceptual models which describe a sequence of sets (e. g. 
total set 17 awareness set 3 consideration set 3 choice set, Kotler (1980) p. 153) have 
been developed to describe this situation (Narayana and Markin (1975), Spiggle and 
Sewall (1987)). Narayana and Markin have investigated the size of the awareness 
set for several branded products empirically. E. g., they report a range from 3-11 
products with an average of 6.5 in the awareness set for toothpaste and similar results 
for other product categories. This allows the conjecture that the event space size is 
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larger than k and in the worst case bounded by /:-times the maximal size of the 
awareness set. 

A survey of the statistical problems (e. g. violation of the independence of irrele- 
vant alternatives assumption, biases in estimating choice models etc.) related to this 
situation can be found in Andrews and Srinivasan (1995) or Andrews and Manrai 
(1998). Recent advances in neuroimaging even allowed experimental proof of the 
influence of branding on brain activity in a choice situation which leads to models 
which postulate interactions between reasoning and emotional chains (e. g. Deppe 
et al. (2005), Bechara et al. (1997)). As result of the sampling process of an ideal 
decision maker we observe a histogram with at most k objects with the drawing fre- 
quencies summing to k. For each event space in k to n, the distribution of the drawing 
frequencies is a partition of k, the set of all possible distributions is given by enumer- 
ating all possible partitions of k for this event space. The probability of observing a 
specific partition in a specific event space is the sum of the probabilities of all sample 
paths of length k leading to this partition. The probability distribution of partitions 
drawn by an ideal decision maker in a specific event space n> k serves as the base of 
the small sample test statistic in section 6. For the theory of partitions see Andrews 
(1976). 



3 Library meta catalogs: An exemplary application area 

For evaluation purposes we apply our techniques in the area of meta catalogs of 
scientific libraries. Due fo transaction costs the detailed inspection of documents in 
the online public access catalog (OPAC) of a library can be put on a par with a 
purchase incidence in a consumer store setting. A market basket consists of all doc- 
uments that have been co-inspected by one user within one session. To answer the 
question, which co-inspections occur non-randomly, for larger samples we apply an 
algorithm based on calculating inspection frequency distribution functions following 
a logarithmic series distribution (LSD) (Geyer-Schulz et al. (2003a)). Such a rec- 
ommender system is operational at the OPAC of the university library of Karlsruhe 
(UBKA) since June 2002 (Geyer-Schulz et al. (2003b)) and within Karlsruhe’s Vir- 
tual Catalog (KVK), a meta catalog searching 52 international catalogs, since March 
2006. These systems are fully operational services accessible by the general public, 
for further information on how to use these see Participate! at http://reckvk.em.uni- 
karlsruhe.de/. 



Table 1. Statistical Properties of the Data (Status of 2007-02-19) 





UBKA 


KVK 


Number of total documents in catalog 


1,000,000 


> 10,000,000 


Number of total co-inspected documents 


527,363 


255,248 


Average market basket size 


4.9 


2.9 


Av. aggregated co-inspections per document 


117.4 


5.4 
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Table 1 shows some characteristics of the UBKA and KVK usage data. Because 
of the smaller market basket size, the shorter observation period, and the much higher 
(unknown) number of total documents in the meta catalog KVK, the average aggre- 
gated co-inspections per document in the KVK is very small. Due to sample size 
constraints methods using statistical tests on distrihutions (like LSD) are only reli- 
ably applicable with many co-inspections. Special small sample statistics are needed 
to compute recommendations out of samples of few co-inspections. Our methods are 
based on the assumption that all documents in the catalog have the same prohahility 
of being co-inspected. In real systems generally this assumption does not hold, but 
especially when starting to observe new catalogs no information about the underlying 
distribution of the inspection processes of documents is known. Finally, recommen- 
dations are co-inspections that occur significantly more often then predicted in the 
case of the assumption being true. 



4 Mathematical notation 

For the mathematical formulation we use the following notation. The number of to- 
tal documents n+l in the catalog is hnite but unknown (this leaves n documents as 
possible co-inspections for each document D in the catalog). Recommendations are 
computed separately for each document D. Each user session (market basket) con- 
tains all documents that the user inspected within that session, multiple inspections of 
the same document are counted as one. All user sessions are aggregated. The aggre- 
gated set C{D) contains all documents, that at least one user has inspected together 
with D. The number of co-inspections with D of all elements of C{D) is known, this 
histogram is called H{D), it is the outcome of the multinomial experiment. When re- 
moving all documents with no inspections from H{D) and then re-writing the num- 
ber of co-inspections as a sum, it can be intepreted as an integer partition of k with 
the number of co-inspections of each co-inspected document as the summands, k is 
the number of non-aggregated co-inspections (multiple inspections in different ses- 
sions are counted separately). E. g. 4-f 1 -f 1 is an integer partition of A: = 6 and shows 
that the corresponding document D has been co-inspected in at least 4 (the highest 
number) different sessions with 3 (the number of summands) other documents, with 
the hrst document 4-times and with the second and third one time each. 



5 POSICI: Probability Of Single Item Co-Inspections 

The hrst method we introduce is based on the following question. What is the prob- 
ability pj (n) that at least one other document has been co-inspected exactly 7 -times 
with document D? To answer the question we use the setup of the multinomial dis- 
tribution directly. Let (Vi, . ..,Nn) be the vector of the number of times document i 
(1 <i<n) was co-inspected with D. Then {Ni , . . . ,Nn) ^ . . . ,q„) , qt = 

^ , 1 < ( < n. Now dehne A, = {Ni = j}. By applying the inclusion-exclusion prin- 

ciple we can now compute: 
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Co-inspection probabiiities for k = 8 (n = 8 to 50) 




Fig. 1. Inspection probabilities pj (n) for ^ = 8 and growing n in POSICI. 



p.(«) = ^(u^') E P{Ann...nAtJ (1) 

\i=l / v=l l<ii<...<iv<H 

Since many of the summands on the right hand side are known to be equal to zero, 
this equation can be implemented quite efficiently. Figure 1 shows pj (n) for A: = 8 
and growing n. In general lim„^oopi {n) = 1 and lim„^oo/t; (n) = 0 for j = 2,3,... 
holds. Further on, pj (n) is decreasing in j for all n. Based on these probablities we 
define the POSICI Recommendation Generating Algorithm: 

1. Let D be the document for which recommendations are calculated. 

2. Let n = k and r be a fixed chosen acceptance threshold (0 <t < 1). 

3. Determine jo = miny= 2 ,...,/t {j \ Pj («) <tpi{n)}. 

4. Recommend all documents that have been co-inpected with D at least y'o-times. 

Thus, e. g. in the setting of figure 1 and t = 0.2 all documents that have been 
co-inspected at least 4-times are being recommended. POSICI is built on the theory, 
that co-inspections other than y-times add more noise than information about the 
incentive to co-inspect the current document y-times. 



6 POMICI: Probability Of Multiple Items Co-Inspections 

The second method is derived from the question: What is the probability ppan {n) 
that the partition corresponding to the complete histogram H{D) of all co-inspections 
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Partition probabiiities for k = 6 (n = 6 to 50) 




n 



Fig. 2. Inspection probabilities ppan (n) for k = 6 and growing n in POMICI. 



with D occurs? To answer this question we re-formulate the problem in an algebraic 
setting. Let X be the set of words of length k from an alphabet of n letters, and /,■ the 
number of letters (i. e. documents), that occur exactly /-times inxGX (i. e. in H{D)). 
First we examine the actions of the group G = S„x Sk on the set X, and then the ac- 
tions of the stabilizer subgroup Gx on the set Sn for the identitiy id G Sn - By applying 
two times the orbit-stabilizer theorem together with Lagrange’s theorem from group 
theory (|G| = \Gx\ \Gx\ = \Gx\ \Gxid\ \Gxtd\) and then some counting arguments we 
come to the solution: 



„ LA _ 1^ _ 1^1 

Ppar,{ ) 1^1 \x\\Gxid\\Gxid\ 



n\ k\ 



( 2 ) 



In general lim„^ooPn hi (”) = 1 and = 0 for all other partitions holds. As can be 

seen exemplary in figure 2, only above a certain n the order by probability of the 
partitions is stable. We use the smallest of these n to construct the POMICI Recom- 
mendation Generating Algorithm: 

1. Let D be the document for which recommendations are calculated. 

2. Let f be a fixed chosen acceptance threshold (0 < f < 1). 

3. Let H 0 be the smallest integer, after which the order by probability of the parti- 
tions for n > no is stable. 

4. Let 5 be the largest integer that occurs in the partition with the highest probability 
below fpi+...+i {no). 

5. For all partitions part with ppa,-t (no) < t pi^ hi (wd) do 
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a) Recommend all documents from H{D) that have been co-inspected at least 
5-times. 

Thus, e. g. in the setting of figure 2 and t = 0.05 all documents that have been 
observed within the partitions 3 + 2+1, 3 + 3, 4+1 + 1, 4 + 2, 5+1 or 6 and have 

been co-inspected at least 3 times are being recommended {no = 21, pi^ i_i (21) = 

0.4555). Note, that this choice of no indicates a risk-averse decision maker. POMICI 
is built on the theory, that the distribution of co-inspections other than j-times reveals 
more information than noise about the incentive to co-inspect the current document 
y'-times. 



7 POSICI vs. POMICI 

Since both methods are based on a homogeneous group of decision makers mod- 
eled by the underlying uniform multivariate distribution, a direct connection between 
them exists. The sum of the probabilities of all partitions from POMICI with at 
least one product that was co-inspected exactly y-times is equal to the probability 
in POSICI, that there exists at least one product, that was co-inspected exactly j- 
times. In other words, we get from POMICI to POSICI by aggregating all partitions 
that only differ in the noise area defined in the POSICI underlying preference the- 
ory. Thus, equation 2 can also be used instead of the inclusion-exclusion principle to 
calculate the probability in equation 1 . 

By setting the threshold t for the POSICI and POMICI algorithms respectively, 
the number of generated recommendations can be adjusted for both methods. As can 
be seen in figure 3, when the total number of recommendations is equal, POMICI 
generally generates longer recommendation lists for fewer documents than POSICI. 



8 Conclusions and further research 

POSICI and POMICI are based on different assumptions in the underlying prefer- 
ence theory. To determine which method leads to qualitatively better recommenda- 
tions in a specific setting the following question has to be answered. When does the 
partition tail of smaller integers resembles noise and when incentive behavior? One 
way to answer the question lies in the human evaluation of larger data sets. This is 
planned for the library application. 

Two ways to enhance the algorithms appear to be promising. First, if the overall 
inspection probability of documents is known (through large behavior data sets), both 
methods can be extended to be based on an underlying non-uniform multinomial 
distribution. This can not be applied in the case of a cold start but can be useful in 
the scenario of very small market baskets covering a large part of the total documents. 
Second, portraying the additions of further co-purchases (k — > k+ 1) as a Markov- 
process enables us to calculate the probability of a product with currently low co- 
inspections to develop into high co-inspections, thus a reliable recommendation. 
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POMICI vs. POSICI 
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Fig. 3. Number of generated recommendations for all documents with A: < 15 on the KVK 
data for various t. 
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Abstract. Number and date expressions are essential information items in corpora and there- 
fore play a major role in various text mining applications. However, so far number expressions 
were investigated in a rather superficial manner. In this paper we introduce a comprehensive 
number classification and present promising, initial results of a classification experiment using 
various Machine Learning algorithms (amongst others AdaBoost and Maximum Entropy) to 
extract and classify number expressions in a German newspaper corpus. 



1 Introduction 

In many natural language processing (NLP) applications such as Information Ex- 
traction and Question Answering number expressions play a major role, e.g. ques- 
tions about the altitude of a mountain, the final score of a football match, or the 
opening hours of a museum make up a significant amount of the users’ informa- 
tion need. However, common Named Entity task definitions do not consider num- 
ber and date/time expressions in detail (or as in the Conference on Computational 
Natural Language Learning (CoNLL) 2003 (Tjong Kim Sang (2003) do not incor- 
porate them at all). We therefore present a novel, extended classification scheme for 
number expressions, which covers all Message Understanding Conference (MUC) 
(Chinchor (1998a)) types but additionally includes various structures not considered 
in common Named Entity definitions. In our approach, numbers are classified ac- 
cording to two aspects: their function in the sentence and their internal structure. We 
argue that our classification covers most of the number expressions occurring in text 
corpora. Based on this classification scheme we have annotated the German CoNLL 
2003 data and trained various machine learning algorithms to automatically extract 
and classify number expressions. We also plan to incorporate the number extraction 
and classification system described in this paper into an open domain Web-based 
Question Answering system for German. As mentioned above, the recognition of 
certain date, time, and number expressions is especially important in the context of 
Information Extraction and Question Answering. E. g. the MUC Named Entity def- 
initions (Chinchor (1998b)) include the following basic types: date, time (<TIMEX>) 
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as well as monetary amount and percentage (<NUMEX>), and thus fostered the de- 
velopment of extraction systems able to handle number and date/time expressions. 
Famous Information Extraction systems developed in conjunction with MUC are 
e.g. PASTES (Appelt et al. (1993)) or LaSIE (Humphreys et al. (1998)). At that 
time, many researchers used finite-state approaches to extract Named Entities. More 
recent Named Entity definitions, such as CoNLL 2003 (Tjong Kim Sang (2003)), 
aiming at the development of Machine Learning based systems, however, again ex- 
cluded number and date expressions. Nevertheless, due to the increasing interest in 
Question Answering and the TREC QA tracks (Voorhees et al. (2000)), recently, a 
number of research groups investigate various techniques to fast and accurately ex- 
tract information items of different types form text corpora and the Web, respectively. 
Many answer typologies naturally include number and date expressions, e.g. the ISI 
Question Answer Typology (Hovy et al. (2002)). Unfortunately, in the corresponding 
papers only the whole Question Answering System’s performance is specified, we 
therefore could not detect any performance values, which would be directly compa- 
rable to our results. A very interesting and partially comparable (they only consider 
a small fraction of our classification) work (Ahn et al. (2005)) investigates the ex- 
traction and interpretation of time expressions. Their reported accuracy values range 
between about 40% and 75%. 

Paper Plan: This paper is structured as follows. Section 2 presents our classifica- 
tion scheme and the annotation. Section 3 deals with the features and the experimen- 
tal setting. Section 4 analyzes the results and comments on the future perspectives. 



2 Classification of number expressions 

Many researchers use regular expressions to find numbers in corpora, however, most 
numbers are part of a larger construct such as ’2,000 miles’ or ’Paragraph 249 Btirg- 
erliches Gesetzbuch’. Consequently, the number without its context has no meaning 
or is highly ambiguous (2,000 miles vs. 2,000 cars). In applications such as Ques- 
tion Answering it is therefore necessary to detect this additional information. Table 1 
shows example questions that obviously ask for number expressions as answers. The 
examples clearly indicate that we are not looking for mere digits but multi-word units 
or even phrases consisting of a number and its specifying context. Thus, a number is 
not a stand-alone information and, as the examples show, might not even look like 
a number at all. This paper therefore proposes a novel, extended classification that 
handles number expressions similar to Named Entities and thus provides a flexible 
and scalable method to incorporate these various entity types into one generic frame- 
work. We classify numbers according to their internal structure (which corresponds 
to their text extension) and their function (which corresponds to their class). 

We also included all MUC types to guarantee that our classification conforms 
with previous work. 




Classifying Number Expressions in German Corpora 555 



Table 1. Example Questions and Corresponding Types 



Q: How far is the Earth from Mars? 


miles? light-years? 


Q: How high is building X? 


meters? floors? 


Q: What are the opening hours of museum X? 


daily from 9 am to 5 pm 


Q: How did Dortmund score against Cottbus last weekend? 


2:3 



2.1 Classification scheme 

Based on Web data and a small fraction of online available German newspaper cor- 
pora (Frankfurter Rundschau* and die tageszeitung^) we deduced 5 basic types: date 
(including date and time expressions), number (covering count and measure expres- 
sions), itemization (rank and score), formula, and isPartofNE (such as street 
number or zipcode). As further analyses of the corpora showed most of the basic 
types naturally split into sub-types, which also conforms to the requirements imposed 
on the classification by our applications. The final classification thus comprises the 
30 classes shown in table 2. The table additionally gives various examples and a short 
explanation of the class’ sense and extension. 

2.2 Corpora and annotation 

According to our findings in Web data and newspaper corpora we developed guide- 
lines which we used to annotate the German CoNLL 2003 data. To ensure a con- 
sistent and accurate annotation of the corpus, we worked every part over in several 
passes and performed a special reviewing process for critical cases. Table 3 shows 
an exemplary extract of the data. It is structured as follows: the first column repre- 
sents the token, the second column its corresponding lemma and the third column its 
part-of-speech, the fourth column specifies the information produced by a chunker. 
We did not change any of these columns. In column five, typically representing the 
Named Entity tag, we added our own annotation. We replaced the given tag if we 
found the tag 0 (mother) and appended our classification in all other cases. ^ While 
annotating the corpora we met a number of challenges: 

• Preprocessing: The CoNLL 2003 corpus exhibits a couple of erroneous sentence 
and token boundaries. In fact, this is much more problematic for the extraction of 
number expressions than for Named Entity Recognition, which is not surprising, 
since it inherently occurs more frequently in the context of numbers. 

• Very complex expressions: We found many date . relative and date . regular 
expressions, which are extremely complex types in terms of length, internal struc- 
ture, as well as possible phrasing and therefore difficult to extract and classify. In 
addition, we also observed very complex number . amount contexts and a couple 
of broken sports score tables, which we found very difficult to annotate. 

^ http://www.fr-online.de/ 

2 

http://www.taz.de/ 

3 

Our annotation is freely available for download. However, we cannot provide the original CoNLL 2003 data, which 
you need to reconstruct our annotation. 
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Table 2. Overview of Number Classes 



Name of Sub-Type 


Examples 


Explanation 


date. period 


for 3 hours, two decades 


time/date period, start 
and end point not specified 


date. regular 


weekdays 10 am to 6 pm 


expressions like 
opening hours etc. 


date. time 


at around 1 1 o’clock 


common time expressions 


date. time. period 


6-10 am 


duration, start and end 
specified 


date . time . relative 


in two hours 


relative specification 
tie: e.g. now 


date . time . complete 


17:40:34 


time stamp 


date. date 


October 5 


common date expressions 


date. date. period 


November 22-29, Wednesday 
to Friday, 1998/1990 


duration, 

start and end specified 


date . date . relative 


next month, in three days 


relative specification 
tie: e.g. today 


date . date . complete 


July 21, 1991 


complete date 


date. date. day 


on Monday 


all weekdays 


date. date. month 


last November 


all months 


date. date. year 


1993 


year specification 


number . amount 


4 books, several thousand 
spectators 


count, number of items 


number . amount . age 


aged twenty, Peter (27) 


age 


number . amount . money 


1 Mio Euros, 1,40 


monetary amount 


number . amount . complex 


40 children per year 


complex counts 


number . measure 


18 degrees Celsius 


measurements not 
covered otherwise 


number . measure . area 


30.000 acres 


specification of area 


number . measure . speed 


30 mph 


specification of speed 


number .measure . length 


100 km bee-line, 10 meters 


specification 
of length, altitude, ... 


number .measure .volume 


43 ,7 1 of rainfall, 230.000 
cubic meters of water 


specification of capacity 


number . measure . weight 


52 kg sterling silver, 
3600 barrel 


specification of weight 


number . measure . complex 


67 1 per square mile, 
30x90x45 cm 


complex measurement 


number. percent 


32 %, 50 to 60 percent 


percentage 


number .phone 


069-848436 


phone number 


itemization. rank 


third rank 


ranking e.g. in competition 


itemization. score 


9 points, 23:26 goals 


score e.g. in tournament 


formula. variables 


n cos(x) 


generic equations 


formula .parameters 


y^4A32*x-' 


specific equations 



• Ambiguities: In some cases we needed a very large context window to disam- 
biguate the expressions they annotated. Additionally, we even found examples 
which we could not disambiguate at all. E.g. ilber 3 Jahre with the possible trans- 
lations more than 3 years ox for 3 year. In German such structures are typically 
disambiguated by prosody. 

• Particular text type: A comparison between CoNLL and the corpora we used to 
develop our guidelines showed that there might be a very particular style. We also 
had the impression that the CoNLL training and test data differ with respect to 
type distribution and style. We therefore based our experiments on the complete 
data and performed cross-validation. 





Classifying Number Expressions in German Corpora 557 



We think that the thus annotated corpora represent a valuable resource, especially, 
given the well-known data sparseness for German. 



Table 3. Extract of the Annotated CoNLL 2003 Data 



Am 


am 


APPRART 


I-PC 


date . date . complete 


14. 


14. 


ADJA 


1-NC 


date . date . complete 


August 


August 


NN 


I-NC 


date . date . complete 


1922 


©cards 


CARD 


I-NC 


date . date . complete 


rief 


rufen 


WFIN 


I -VC 


0 


er 


er 


PPER 


I-NC 


0 


den 


d 


ART 


B-NC 


0 


katholischen 


katholisch 


ADJA 


I-NC 


0 


Gesellenverein 


Gesellenverein 


NN 


I-NC 


0 


ins 


ins 


APPRART 


I-PC 


0 


Leben 


Leben 


NN 


I-NC 


0 






$. 


0 


0 



Furthermore, our findings during the annotation process again emphasized the 
need of an integrated concept of number expressions and Named Entities: we found 
467 isPartofNE items, which are extremely difficult to classify without any hint 
about proper names in the context window. 



3 Experimental evaluation 

3.1 Features 

Our features (see table 4 for details) are adapted from those reported in previous work 
on Named Entity Recognition (e.g. Bikel et al. (1997), Carreras et al. (2003)). We 
based the extraction on a very simple and fast analysis of the tokens combined with 
shallow grammatical clues. To additionally capture information about the context we 
used a sliding window of hve tokens (the word itself, the previous two, the following 
two). 

3.2 Classifiers 

To get a feeling for the expectable performance, we conducted a preliminary test 
by experimenting with Weka (Witten et al. (2005)). For this purpose we ran the 
Weka implementations of a Decision Tree, k-Nearest Neighbor, and Naive Bayes 
algorithm with the standard settings and no preprocessing or tuning. Because of pre- 
vious, promising experiences with AdaBoost (Carreras et al. (2003)) and Maximum 
Entropy in similar tasks, we decided to also apply these two classifiers. We used the 
maxent implementation of the Maximum Entropy algorithm'^. Eor the experiments 
with AdaBoost we used our own C++ implementation, which we tuned for large 
sparse feature vectors with binary entries. 

4 

http://www2.nict.go.jp/x/xl61/members/mutiyama/software.html 
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Table 4. Overview of Features Used 



feature group 


features 


only digit strings 


2-digit integer [30-99], other 2-digit integer, 4-digit 
integer [1000-2100], other 4-digit integer, other integer 


digit and non-digit strings 


1 -digit or 2-digit followed by point, 4-digit with central point 
or colon, any digit sequence with point, colon, comma, 
comma and point, hyphen, slash, or other non-digit character 


non-digit strings 


any character sequence max length 3, any character sequence, 
followed by point, any character sequence with slash, 
any character sequence 


grammar 


part-of-speech tag, lemma 


window 


all features mentioned above for window -I-/-2 



3.3 Results 

The performance of the Decision Tree, k-Nearest Neighbor, Naive Bayes, and Maxi- 
mum Entropy algorithm is on average mediocre, as Table 5 reveals. On the contrary, 
our AdaBoost implementation shows satisfactory or even good f-measure values for 
almost all cases and thus significantly outperforms the rest of the classifiers. 



Table 5. Overview of the F-Measure Values (AB: AdaBoost, DT: Decision Tree, KNN: k- 
Nearest Neighbor, ME: Maximum Entropy, NB: Naive Bayes) 



class 


AB 


DT 


KNN 


ME 


NB 


class 


AB 


DT 


KNN 


ME 


NB 


other 


0.99 


0.99 


0.98 


0.99 


0.97 


itemization. score 


0.83 


0.43 


0.40 


0.78 


0.04 


date 


0.37 


0.13 


0.21 


0.24 


0.19 


number 


0.64 


0.00 


0.08 


0.00 


0.00 


dale. date 


0.67 


0.73 


0.67 


0.74 


0.09 


number.amount 


0.33 


0.53 


0.25 


0.67 


0.26 


date. date. complete 


0.72 


0.61 


0.74 


0,49 


0.20 


number.amount.age 


0.62 


0.28 


0.14 


0.45 


0.02 


date. date. day 


0.53 


0.15 


0.14 


0.20 


0.06 


number.amount. complex 


0.09 


0.00 


0.00 


0.00 


0.00 


dale. date. month 


0.37 


0.05 


0.08 


0.24 


0.00 


number.amount. money 


0.82 


0.45 


0.28 


0.79 


0.30 


date. date. period 


0.43 


0.38 


0.36 


0.45 


0.09 


number.measure 


0.22 


0.16 


0.00 


0.17 


0.00 


date. date. relative 


0.54 


0.36 


0.16 


0.59 


0.00 


number, measure. area 


0.88 


0.10 


0.00 


0.40 


0.00 


date. date. year 


0.82 


0.73 


0.58 


0.76 


0.60 


number, measure. complex 


0.34 


0.21 


0.19 


0.22 


0.09 


date. regular 


0.49 


0.43 


0.37 


0.54 


0.14 


number, measure. length 


0.69 


0.17 


0.11 


0.39 


0.01 


date. time 


0.87 


0.76 


0.67 


0.83 


0.45 


number, measure. speed 


0.91 


0.17 


0.18 


0.00 


0.00 


date. time. period 


0.41 


0.40 


0.46 


0.38 


0.31 


number, measure, volume 


0.66 


0.06 


0.00 


0.00 


0.00 


dale. time. relative 


0.38 


0.02 


0.07 


0.00 


0.00 


number, measure, weight 


0.49 


0.00 


0.00 


0.00 


0.00 


itemization 


0.21 


0.28 


0.23 


0.17 


0.12 


number.percent 


0.83 


0.32 


0.10 


0.56 


0.06 


itemization. rank 


0.84 


0.31 


0.23 


0.70 


0.00 


number.phone 


0.96 


0.85 


0.89 


0.95 


0.65 



Table 5 also shows that there are classes with a consistently poor performance, 
such as number . amount . complex, number . measure, or itemization, and a con- 
sistently good performance, such as number, phone or date . date .year. We think 
that this correlates with the amount of data as well as the heterogeneity of the classes. 
For instance, number .measure and itemization items occur indeed frequently in 
the corpus but these two classes are-according to our definition-’ garbage collec- 
tors’ and therefore much less homogenous. In contrast, there are classes, such as 
date . time .period or date . regular, with rather low f-measure values but a very 
precise definition; we admittedly suspect that the annotation of these types in our 
corpora might be inconsistent or inaccurate. We also suppose that there are number 
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expressions which exhibit an exceedingly large variety of phrasing. As a matter of 
fact, these are inherently difficult to learn if the data do not feature sufficient cover- 
age. 



Table 6. Overview of the Precision Values (AB: AdaBoost) 



class 


AB 


class 


AB 


class 


AB 


other 


0.98 


date. time 


0.88 


number.amount.complex 


0.39 


date 


0.61 


date.time.period 


0.54 


number.percent 


0.87 


date.date 


0.75 


date.time.relative 


0.50 


number.phone 


0.96 


date.date. complete 


0.79 


itemization 


0.34 


number.measure 


0.70 


date.date. day 


0.83 


itemization. rank 


0.88 


number.measure.area 


0.93 


date.date. month 


0.79 


itemization, score 


0.91 


number.measure.length 


0.85 


date.date. year 


0.85 


number 


0.81 


number.measure.speed 


0.94 


date.date. relative 


0.73 


number.amount 


0.48 


number.measure. volume 


0.76 


date.date. period 


0.65 


number.amount.age 


0.79 


number.measure. weight 


0.56 


date .regular 


0.68 


number.amount.money 


0.89 


number.measure.complex 


0.65 



Fortunately, there are is a number of classes with a pretty high f-measure value- 
that is more than 0.8-for at least one of the five classifiers, e.g. date. date. year, 
itemization, rank, and number .phone. More importantly there are, as Table 6 
shows, only six classes with a precision value of less than 0.6. We are therefore very 
confident to be able to successfully integrate the AdaBoost implementation of our 
number extraction component into a Web-based open domain Question Answering 
System, since in a Web-based framework the focus tends to be on precision rather 
than coverage or recall. 



4 Conclusions and future work 

We presented a novel, extended number classification and developed guidelines to 
annotate a German newspaper corpus accordingly. On the basis of our annotated data 
we have trained and tested five classification algorithms to automatically extract and 
classify them with promising evaluation results. However, the accuracy is still low 
for some classes, especially for the small or heterogenous ones. But we feel confident 
to improve our system by incorporating selected training data, especially, in the case 
of small classes. To find the weak points in our system, we plan to perform a detailed 
analysis of all number types and their precision, recall, and f-measure values. We 
also consider a revision of our annotation, because there still might be inconsistently 
and inaccurately annotated sections in the corpus. As mentioned above, the CoNLL 
2003 data exhibit a typical newspaper style, which might limit the applicability of 
our system to particular corpus types (although, initial experiments with Web data 
do not support this skepticism). We therefore intend to augment our training data 
with Web texts annotated according to our guidelines. In addition, we plan to ex- 
periment with an expanded feature set and several pre-processing methods such as 
feature selection and normalization. Research in the area of Named Entity extraction 
shows that multiple classifier systems or the concept of multi-view learning might be 
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especially effective in our application. We therefore plan to investigate several clas- 
sifier combinations and also take a hybrid approach-combining grammar rules and 
statistical methods-into account. We plan to integrate our number extraction system 
into a Web-based open domain Question Answering system for German and hope to 
improve the coverage and performance of the answer types processed. While there 
is still room for improvement, we think-considering the complexity of our task-the 
achieved performance is surprisingly good. 
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Abstract. We propose benchmarking users’ navigation patterns for the evaluation of non- 
profit Web portal success and apply multiple-criteria decision analysis (MCDA) for this task. 
Benchmarking provides a potential for success level estimation, identification of best prac- 
tices, and improvement. MCDA enables consistent preference decision making on a set of 
alternatives (i. e. portals) with regard to the multiple decision criteria and the specific prefer- 
ences of the decision maker (i. e. portal provider). We apply our method to non-profit portals 
and discuss the results. 



1 Introduction 

Portals within an integrated environment provide users with information, links to 
information sources, services, and productivity and community supporting features 
(e. g., email, calendar, groupware, and forum). Portals can be classified according 
to their main purpose into, e. g., community portals, business or market portals, or 
information portals. In this paper we focus on non-profit information portals. 

Usage of non-profit portals is for free in general. Nevertheless, they cause costs. 
This makes success evaluation an important task in order to optimize the service 
quality given usually limited resources. The interesting questions are: (1) what meth- 
ods and criteria should be applied for success measurement, and (2) what kind of 
evaluation referent should be employed for the interpretation of results. Simple usage 
statistics, usage metrics (indicators) as well as navigation pattern analysis have been 
proposed for such a task, usually within the framework of a goal-centered evaluation 
or an evaluation of improvement relative to past performance. Goal-centered eval- 
uation, however, requires knowledge of desired performance levels. Defining such 
levels in the context of non-profit portal usage may due to lack of knowledge or ex- 
perience be a difficult task. For instance, how often has a page to be requested in 
order to be considered as successful? On the other hand, evaluation of improvement 
is incomplete because it does not provide information about the success level at all. 
Benchmarking, on the contrary, does not require definition of performance levels in 
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advance. Furthermore, it has proved suitable for success level estimation, identifica- 
tion of best practices, and improvement (Elmuti and Kathawala (1997)). 

We present our approach of non-profit information portal success evaluation, 
based on benchmarking usage patterns from several similar portals by applying 
MCDA. The applied measurement criteria are not based on common e-commerce 
customer lifecycle measures such as acquisition or conversion rates (Cutler and Sterne 
(2000)). Thus the criteria are especially suitable for (but not limited to) the analysis 
of portals that offer their contents for free and without the need for users to register. 
At such portals due to anonymity or privacy directives it is often difficult to track 
customer relationships over several sessions. This is a common case with non-proht 
portals. 

The paper is organized as follows: in Section 2 we give a brief overview over 
related work. In Section 3 the method is described. In Section 4 we present a case 
study and discuss the results. Section 5 contains some conclusions. 



2 Related work 

Existing usage analysis approaches can be divided into three groups: analysis of (1) 
simple traffic- and time-based statistics, (2) session based metrics and patterns, and 
(3) sequential usage patterns. 

Simple statistics (Hightower et al. (1998)) are, for instance, the number of hits 
for a certain period or for a certain page. However, those figures are of limited use be- 
cause they do not contain information about dependencies between a user’s requests 
during one visit (i. e. session). 

Session based metrics are applied in particular for commercial site usage, e. g., 
customer acquisition and conversion rates (Berthon et al. (1996), Cutler and Sterne 
(2000)) or micro conversion rates such as click-to-basket and basket-to-buy (Lee et 

al. (1999)). Data mining methods can deliver interesting information about patterns 
and dependencies between page requests. Eor example, association rule mining may 
uncover pages which are requested most commonly together in the users’ sessions 
(Srivastava et al. (2000)). Session based analysis with metrics and data mining gives 
a quite well insight into dependencies between page requests. What is missing, is the 
explicit analysis of the users’ sequences of page requests. 

With sequential analysis the traversal paths of users can be analyzed in detail 
and insights gained about usage patterns, such as “Over which paths users get from 
page A to B?’’. Thus “problematic" paths and pages can be identified (Berendt and 
Spiliopoulou (2000)). 

The quality of the interpretation of results depends considerably on the em- 
ployed evaluation referent. Eor commercial sites existing market figures can be used. 
Eor non-profit portals such “market” figures in general do not exist. Alternative ap- 
proaches are proposed: Berthon et al. (1996) suggest to interpret measurement results 

w. r. t. the goals of the respective provider. However, this implies that the provider 
himself is able to specify realistic goals. Berendt and Spiliopoulou (2000) measure 
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success by comparing usage patterns of different user groups and by analyzing per- 
formance outcomes relative to past performance. While this is suitable for the iden- 
tification of a site’s weak points and for its improvement, neither the overall success 
level of the site nor the necessity for improvement can be estimated in this way. High- 
tower et al. (1998) propose a comparative analysis of usage among similar Web sites 
based on a simple statistical analysis. As already mentioned above, simple statistics 
alone are of limited information value. 



3 Method 

Our goal is to measure a portal’s success of providing information content pages. 
Moreover, we want to identify weak points and find possibilities for improvement. 
The applied benchmarking criteria are based on sequential usage pattern analysis. 
Our approach consist of three steps: (1) preprocessing the page requests, (2) defining 
the measurement criteria, and (3) developing the MCDA model. 

3.1 Preprocessing page requests 

Big portals, especially those with highly dynamic content, can contain many thou- 
sands of pages. In general, for such portals usage patterns at the individual page level 
do not occur frequently enough for the identification of interesting patterns. There- 
fore the single page requests as defined by their URI in the log are mapped to a 
predefined concept hierarchy, and the dependencies between concepts are analyzed. 
Various types of concept hierarchies can he defined, e.g., based on content, service, 
or page type (Berendt and Spiliopoulou (2000), Spiliopoulou and Pohle (2001)). We 
define a concept hierarchy based on page types (Fig. 1).' The page requests then are 
mapped (i. e. classified) according to their URI if possible. If the URI does not con- 
tain sufficient information, the text to link ratios of the corresponding portal pages are 
analyzed and the requests are mapped accordingly.^ Homepage requests are mapped 
to concept H, all other requests are mapped according to the descriptions in Table 1 . 

3.2 Measurement criteria 

We concentrate on the part of the navigation paths between the first request for page 
type H (homepage) and the first consecutive request for a target page type from the 
set TP = {M, MNI, MNINE}. Of interest is whether or not users navigating from the 
homepage reach those target pages, and how their traversal paths look like. Sequen- 
tial usage pattern analysis (Berendt and Spiliopoulou (2000), Spiliopoulou and Pohle 
(2001)) is applied. 



* The page type definitions are partly adapted from Cooley et al. (1999). 

^ Therefore a training set is created manually by the expert and then analyzed by a classifi- 
cation learning algorithm. 
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Page Type 




Fig. 1. Page type based concept hierarchy. 



A log portion is a set S = ■•■Aa'} of sessions. A session is a set 5 = 

{ri,r2,...,ri} of page requests. All sessions s G S containing at least one request 
which is related to concept H, denoted as con{ri) = H for i= 1 , ...,L, are of interest. 
These sessions are termed active sessions: = {5 G S | con{ri) = H A r,- S 5}. 

Let seq{s) = {a\a2...an) denote the sequence, i. e. an ordered list, of all page 
requests in session s. Then sseq = {b\b2---bm) is a subsequence of seq{s), denoted 
as sseq IIP seq{s), iff there exist an i and a,-, € s and = bj, V7 = 

1 , ...,wi. The subsequence of a user’s clickpath in a session which is of interest starts 
with con{b\) ~ H' and ends with con{bm) = p, with m = min,=i 2 n{i \ con{ri) = 
pApG TP}. H' denotes the first occurrence of H in seq{s), and p is the first subse- 
quent occurrence of a request for a target page type from the set T P. We denote this 
subsequence {bi . ..bm) as H' where * is a wildcard for all in between requests. We 
want to analyze navigation based usage patterns only. Thus all sessions containing 
Ti -kp with requests for page types L and S not part of the sequence are of interest. 
These sessions are called positive sessions w. r. t. the considered target page type p: 
= {a’ € 5 I IIP seq{s) A con{ri) 7^ {L, S}} ,Vr; IIP H' kp. 

Definition 1. The effectiveness of requests for a page of type p over all active ses- 
sions is defined by 

I ^ I 

eff{p) = ( 1 ) 

The effectiveness ratio shows in how many active sessions requests for a page of 
type p occur. A low value may indicate a problem with those pages. 

Definition 2. Let length{Pl' , p) ^ denote the length of a sequence PI' kp in s G S, given 
by the number of its non-PP elements. Then the efficiency of requests for a page of 
type p over all respective positive sessions is defined by 

I I 

efc{p) lengthen', p)f 

The efficiency ratio shows how many pages on average are requested in the pos- 
itive sessions before the first request for a target page of type p occurs. A low value 
stands for long click paths on average which in turn may indicate a problem for the 
users with reaching those pages. 
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3.3 Development of the MCDA model 

The MCDA method applied is Simple Additive Weighting (SAW). SAW is suit- 
able for decision problems with multiple alternatives based on multiple (usually 
conflicting) criteria. It allows consistent preference decision making on a set A = 
{(31,132) of alternatives, a set C = |ci,C 2 ) ■•■)C;} of criteria and their corre- 
sponding weights W = {wi,W 2 ,...,w/} CZldWi = 1). The latter reflect the decision 
maker’s preference for each criterion. SAW aggregates the criteria Cj based outcome 
values Xij for an alternative a,- into an overall utility score U^^{ai). The goal Is 
to obtain a ranking of the alternatives according to their utility scores. Firstly, the 
outcome values xij are normalized to the interval [0,1] by applying a value function 
Uj{xij). Following, the utility score for each alternative is derived by = 

X);=i '^7 ■ Uj{xij),^ai G A. For SAW the criteria based outcome values must be at 
least of an ordinal scale and the decision maker’s preference order relation on them 
must be complete and transitive. For a more detailed introduction we refer to Figueira 
et al. (2005), Lenz and Ablovatski (2006). 



Table 1. Page types 



Concept 


Page Type 


Purpose 


Characteristics 


H 


Head 


Entry page for the considered portal 


Topmost page of the focused site hi- 






area 


erarchy or sub-hierarchy 


M 


Media 


Provides information content repre- 
sented by some form of media such 
as text or graphics 


High text to link ratio 


NI, NINE 


Navigation 


Provides links to site internal (NI) 
or to site internal and external 
(NINE) targets 


Small text to link ratio 


MNI, MNINE 


Media/Navigation 


Provides some (introductory) infor- 
mation and links to further informa- 


Medium text to link ratio 






tion sources 




S 


Search Service 


Provides search service 


Contains a search form 


L 


Search Result 


Provides search results 


Contains a result list 



We use eff{p) and efc{p) as measurement criteria (see Fig. 2) for the portal 
success evaluation. Within this context we give the following definition of portal 
success: 

Definition 3. The success level of a non-profit portal in providing information con- 
tent pages w. r. t. the chosen criteria and weights is determined by its utility score 
relative to the utility scores of all other considered portals a G A. 

According to Definition 3 the portal with the highest utility score, denoted as a * , 
is the most successful: {a*) > {ai) with a* a,- and a*, ai € A. 
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4 Case study 



The proposed approach is applied to a case study of four German eGovernment por- 
tals. Each portal belongs to a different German state. Their contents and services are 
mainly related to state specific topics about schools, education, educational policy 
etc. One of the main target user groups are teachers. 

Preprocessed^ log data from November 15th and 19th of 2006 from each server 
are analyzed. The numbers of active sessions in the respective log portions are 746 
for portal 1, 2168 for portal 2, and 4692 for portal 3. The obtained decision matrix 
is shown in Fig. 2. The main decision criteria are the p requests with the subcrite- 
ria eff{p) and efc{p). The corresponding utility score function for this two level 



structure of criteria is (a,) = X)/=i 



W; 



El 



[^jk ' jk) 



yai G A. 





M (0.33) 


1 MNI (0.33) 


1 MNINE (0.33) 


IjSAW 


CO 

d 


efciO.n) 


CO 

d 


efciO.U) 


c//(0.83) 


c/c(0.17) 


Portal 1 


0.1126 


0.2054 


0.2815 


0.3518 


0.6408 


0.7685 


0.36 


Portal 2 


0.1425 


0.2050 


0.1836 


0.2079 


0.1965 


0.2338 


0.18 


Portal 3 


0.0058 


0.2455 


0.0254 


0.2459 


0.3382 


0.4175 


0.15 



Fig. 2. Decision matrix (with weights in brackets) 



The interpretation of results is carried out from the perspective of portal provider 
2 (denoted as p2). Thus, the weights are set according to the preferences of p2. As can 
be seen from the decision matrix (Fig. 2), M, MNI, and MNINE requests are equally 
important to p2. However, effectiveness of requests is considerably more important 
to p2 than efficiency, i. e. if is more important that users find (request) the pages at 
all, than that they do that within the shortest possible paths. 

The results show that portal 1 exhibits a superior overall performance over the 
two others. According to Definition 3 portal 1 is clearly the most successful w. r. t. 
the considered criteria and weights. Several problems for portal 2 can be identified. 
Efficiency values for MNI and MNINE requests are lower (i. e. the users’ clickpaths 
are longer) than for the two other portals. The effectiveness value of MNINE requests 
is the lowest. This indicates a problem with those pages. As a first step towards 
identifying possible causes we apply the sequence mining tool WUM (Spiliopoulou 
and Faulstich (1998)) for visualizing the usage patterns containing MNINE requests. 
The results show that those patterns contain many NI and NINE requests in between. 
A statistical analysis of consecutive NINE requests confirms fhese findings. As if can 
be seen from Fig. 3 fhe percentage frequency n{X = x) /N ■ 100 (for x = 1 , 2, . . . , 5) 
of sessions with one or several consecutive NINE requests is significantly higher for 
portal 2. Finally, a manual inspection of the portal’s pages uncovers many navigation 
pages (NI, NINE) containing only very few links and nothing else. Such pages are 
the cause for a deep and somewhat “too complicated” hierarchical structure of the 
portal site which might cause users to abandon it before reaching any MNINE page. 

^ For a detailed description on preprocessing log data refer to Cooley et al. (1999). 
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□ Portal 1 
♦ Portal 2 
T Portal 3 



Fig. 3. NINE request distribution 



We recommend to p2 to flatten the hierarchical structure by reducing the number 
of NI, NINE pages by, e. g., merging several consecutive “few-link” NI, NINE pages 
into one page where possible. Another solution could be to use more MNINE pages 
for navigation purposes instead of NINE pages (as it is the quite successful strategy 
of portal 1). 



5 Conclusions 

A multi-criteria decision model for success evaluation of information providing por- 
tals based on the users’ navigation patterns is proposed. The objective is to estimate 
a portal’s performance, identify weak points, and derive possible approaches for im- 
provement. The model allows a systematic comparative analysis of the considered 
portal alternatives on basis of the decision maker’s preferences. Furthermore, the 
model is very flexible. Criteria can be added or excluded according to the evalua- 
tion task at hand. In practice, this approach can be a useful tool that helps a portal 
provider to evaluate and improve its success, especially in areas where no common 
“market figures” or other success benchmarks exist. 

However, a prerequisite for this approach is the existence of other similar portals 
which can serve as benchmarks. This is a limiting factor, since (1) there simply may 
not exist similar portals or (2) other providers are not willing (e. g., due to competi- 
tion) or able (e. g., due to capacity) to cooperate. 

Future research will include the analysis of patterns with more than one target 
page type request in one session. We also plan to analyze and compare the users’ 
search behavior to get hints on the quality of the portals’ search engines. Finally, the 
usage based MCDA model will be extended by a survey to incorporate user opinions. 
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Abstract. Within the last decade text mining, i.e., extracting sensitive information from text 
corpora, has become a major factor in business intelligence. The automated textual analysis of 
law corpora is highly valuable because of its impact on a company’s legal options and the raw 
amount of available jurisdiction. The study of supreme court jurisdiction and international law 
corpora is equally important due to its effects on business sectors. 

In this paper we use text mining methods to investigate Austrian supreme administra- 
tive court jurisdictions concerning dues and taxes. We analyze the law corpora using R with 
the new text mining package tm. Applications include clustering the jurisdiction documents 
into groups modeling tax classes (like income or value-added tax) and identifying jurisdiction 
properties. The findings are compared to results obtained by law experts. 



1 Introduction 

A thorough discussion and investigation of existing jurisdictions is a fundamental ac- 
tivity of law experts since convictions provide insight into the interpretation of legal 
statutes by supreme courts. On the other hand, text mining has become an effective 
tool for analyzing text documents in automated ways. Conceptually, clustering and 
classification of jurisdictions as well as identifying patterns in law corpora are of key 
interest since they aid law experts in their analyses. E.g., clustering of primary and 
secondary law documents as well as actual law firm data has been investigated by 
Conrad et al. (2005). Schweighofer (1999) has conducted research on automatic text 
analysis of international law. 

In this paper we use text mining methods to investigate Austrian supreme ad- 
ministrative court jurisdictions concerning dues and taxes. The data is described in 
Section 2 and analyzed In Section 3. Results of applying clustering and classifica- 
tion techniques are compared to those found by tax law experts. We also propose 
a method for automatic feature extraction (e.g., of the senate size) from Austrian 
supreme court jurisdictions. Section 4 concludes. 
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2 Administrative Supreme Court jurisdictions 

2.1 Data 

The data set for our text mining investigations consists of 994 text documents. Each 
document contains a jurisdiction of the Austrian supreme administrative court (Ver- 
waltungsgerichtshof, VwGH) in German language. Documents were obtained 
through the legal information system (Rechtsinformationssystem, RIS; http : / / ris . 
bka . gv . at / ) coordinated by the Austrian Federal Chancellery. Unfortunately, docu- 
ments delivered through the RIS interface are HTML documents oriented for browser 
viewing and possess no explicit metadata describing additional jurisdiction details 
(e.g., the senate with its judges or the date of decision). The data set corresponds to 
a subset of about 1000 documents of material used for the research project “Analyse 
der abgabenrechtlichen Rechtsprechung des Verwaltungsgerichtshofes” supported 
by a grant from the Jubilaumsfonds of the Austrian National Bank (Oesterreichische 
Nationalbank, OeNB), see Nagel and Mamut (2006). Based on the work of Achatz 
et al. (1987) who analyzed tax law jurisdictions in the 1980s this project investigates 
whether and how results and trends found by Achatz et al. compare to jurisdictions 
between 2000 and 2004, giving insight into legal norm changes and their effects 
and unveiling information on the quality of executive and juristic authorities. In the 
course of the project, jurisdictions especially related to dues (e.g., on a federal or 
communal level) and taxes (e.g., income, value-added or corporate taxes) were clas- 
sified by human tax law experts. These classifications will be employed for validating 
the results of our text mining analyses. 

2.2 Data preparation 

We use the open source software environment R for statistical computing and graph- 
ics, in combination with the R text mining package tm to conduct our text mining ex- 
periments. R provides premier methods for clustering and classification whereas tm 
provides a sophisticated framework for text mining applications, offering functional- 
ity for managing text documents, abstracting the process of document manipulation 
and easing the usage of heterogeneous text formats. 

Technically, the jurisdiction documents in HTML format were downloaded 
through the RIS interface. To work with this inhomogeneous set of malformed HTML 
documents, HTML tags and unnecessary white space were removed resulting in 
plain text documents. We wrote a custom parsing function to handle the automatic 
import into tm’s infrastructure and extract basic document metadata (like the file 
number). 



3 Investigations 

3.1 Grouping the jurisdiction documents into tax classes 

When working with larger collections of documents it is useful to group these into 
clusters in order to provide homogeneous document sets for further investigation by 
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experts specialized on relevant topics. Thus, we investigate different methods known 
in the text mining literature and compare their results with the results found by law 
experts. 

k-means Clustering 

We start with the well known k-means clustering method on term-document ma- 
trices. Let tft d be the frequency of term t in document d, m the number of docu- 
ments, and dfi is the number of documents containing the term t. Term-document 
matrices M with respective entries to, ^ are obtained by suitably weighting the term- 
document frequencies. The most popular weighting schemes are Term Frequency 
(tf), where (O, = tftj, and Term Frequency Inverse Document Frequency (tf-idf), 

with = tf«,d log 2 (wi/df,), which reduces the impact of irrelevant terms and high- 
lights discriminative ones by normalizing each matrix element under consideration 
of the number of all documents. We use both weightings in our tests. In addition, 
text corpora were stemmed before computing term-document matrices via the Rstem 
(Temple Lang, 2006) and Snowball (Hornik, 2007) R packages which provide the 
Snowball stemming (Porter, 1980) algorithm. 

Domain experts typically suggest a basic partition of the documents into three 
classes (income tax, value-added tax, and other dues). Thus, we investigated the ex- 
tent to which this partition is obtained by automatic classification. We used our data 
set of about 1000 documents and performed k-means clustering, for k G {2, . . . , 10}. 
The best results were in the range between k = 3 and k = 6 when considering the im- 
provement of the within-cluster sum of squares. These results are shown in Table 1. 
For each k, we compute the agreement between the k-means results based on the 
term-document matrices with either tf or tf-idf weighting and the expert rating into 
the basic classes, using both the Rand index (Rand) and the Rand index corrected for 
agreement by chance (cRand). Row “Average” shows the average agreement over 
the four ks. Results are almost identical for the two weightings employed. Agree- 



Table 1. Rand index and Rand index corrected for agreement by chance of the contingency 
tables between k-means results, for k G {3 , 4, 5 , 6} , and expert ratings for tf and tf-idf weight- 
ings. 





Rand 


cRand 


k 


tf 


tf-idf 


tf 


tf-idf 


3 


0.48 


0.49 


0.03 


0.03 


4 


0.51 


0.52 


0.03 


0.03 


5 


0.54 


0.53 


0.02 


0.02 


6 


0.55 


0.56 


0.02 


0.03 


Average 


0.52 


0.52 


0.02 


0.03 



ments are rather low, indicating that the “basic structure” can not easily be captured 
by straightforward term-document frequency classification. 

We note that clustering of collections of large documents like law corpora presents 
formidable computational challenges due to the dimensionality of the term-document 
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matrices involved: even after stopword removal and stemming, our about 1000 docu- 
ments contained about 36000 different terms, resulting in (very sparse) matrices with 
about 36 million entries. Computations took only a few minutes in our cases. Larger 
datasets as found in law firms will require specialised procedures for clustering high- 
dimensional data. 

Keyword based Clustering 

Based on the special content of our jurisdiction dataset and the results from k-means 
clustering we developed a clustering method which we call keyword based cluster- 
ing. It is inspired by simulating the behaviour of tax law students preprocessing the 
documents for law experts. Typically the preprocessors skim over the text looking for 
discriminative terms (i.e., keywords). Basically, our method works in the same way: 
we have set up specific keywords describing each cluster (e.g., “income” or “income 
tax” for the income tax cluster) and analyse each document on the similarity with the 
set of keywords. 



< 

> 



None Income tax VA tax 



Experts 



Fig. 1. Plot of the contingency table between the keyword based clustering results and the 
expert rating. 



Figure 1 shows a mosaic plot for the contingency table of cross-classifications of 
keyword based clustering and expert ratings. The size of the diagonal cells (visual- 
izing the proportion of concordant classifications) indicates that the keyword based 
clustering methods works considerably better than the k-means approaches, with a 
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Rand index of 0.66 and a corrected Rand index of 0.32. In particular, the expert 
“income tax” class is recovered perfectly. 

3.2 Classification of jurisdictions according to federal fiscal code regulations 

A further rewarding task for automated processing is the classification of jurisdic- 
tions into documents dealing and into documents not dealing with Austrian federal 
fiscal code regulations (Bundesabgabenordnung, BAO). 

Due to the promising results obtained with string kernels in text classification and 
text clustering (Lodhi et al., 2002; Karatzoglou and Feinerer, 2007) we performed a 
“C-svc” classification with support vector machines using a full string kernel, i.e., 
using 

Kx,y) = -v,(y) 

sel.* 

as the kernel function k{x,y) for two character sequences x and y. We set the decay 
factor Xs = 0 for all strings > n, where n denotes the document lengths, to instan- 
tiate a so-called full string kernel (full string kernels are computationally much better 
natured). The symbol X* is the set of all strings (under the Kleene closure), and Vi(x) 
denotes the number of occurrences of s in x. 

For this task we used the kernlab (Karatzoglou et al., 2006; Karatzoglou et 
al., 2004) R package which supports string kernels and SVM enabled classification 
methods. We used the first 200 documents of our data set as training set and the next 
50 documents as test set. We compared the 50 received classifications with the ex- 
pert ratings which indicate whether a document deals with the BAO by constructing 
a contingency table (confusion matrix). We received a Rand index of 0.49. After cor- 
recting for agreement by chance the Rand index floats around at 0. We measured a 
very long running time (almost one day for the training of the SVM, and about 15 
minutes prediction time per document on a 2.6 GHz machine with 2 GByte RAM). 

Therefore we decided to use the classical term-document matrix approach in ad- 
dition to string kernels. We performed the same set of tests with tf and tf-idf weight- 
ing, where we used the first 200 rows (i.e, entries in the matrix representing docu- 
ments) as training set, the next 50 rows as test set. 



Table 2. Rand index and Rand index corrected for agreement by chance of the contingency 
tables between SVM classification results and expert ratings for documents under federal fiscal 
code regulations. 





tf 


tf-idf 


Rand 


0.59 


0.61 


cRand 


0.18 


0.21 



Table 2 presents the results for classifications obtained with both tf and tf-idf 
weightings. We see that the results are far better than the results obtained by employ- 
ing string kernels. 
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These results are very promising, and indicate the great potential of employing 
support vector machines for the classification of text documents obtained from ju- 
risdictions in case term-document matrices are employed for representing the text 
documents. 

3.3 Deriving the senate size 



Table 3. Number of jurisdictions ordered by senate size obtained by fully automated text 
mining heuristics. The percentage is compared to the percentage identified by humans. 



Senate size 


0 


3 


5 


9 


Documents 


0 


255 


739 


0 


Percentage 


0.000 


25.654 


74.346 


0.000 


Human Percentage 


2.116 


27.306 


70.551 


0.027 



Jurisdictions of the Austrian supreme administrative court are obtained in so- 
called senates which can have 3, 5, or 9 members, with size indicative of the “diffi- 
culty” of the legal case to be decided. (It is also possible that no senate is formed.) An 
automated derivation of the senate size from jurisdiction documents would be highly 
useful, as it would allow to identify structural patterns both over time and across 
areas. Although the formulations describing the senate members are quite standard- 
ized it is rather hard and time-consuming for a human to extract the senate size from 
hundreds of documents because a human must read the text thoroughly to differ be- 
tween senate members and auxiliary personnel (e.g., a recording clerk). Thus, a fully 
automated extraction would be very useful. 

Since most documents contain standardized phrases regarding senate members 
(e.g., “The administrative court represented by president Dr. X and the judges Dr. 
Y and Dr. Z . . . decided . . . ”) we developed an extraction heuristic based on widely 
used phrases in the documents to extract the senate members. In detail, we investigate 
punctuation marks and copula phrases to derive the senate size. Table 3 summarizes 
the results for our data set by giving the total number of documents for senate sizes 
of zero (i.e., documents where no senate was formed, e.g., due to dismissal for want 
of form), three, five, or nine members. The table also shows the percentages and 
compares these to the aggregated percentages of the full data set, i.e., n > 1000, 
found by humans. Figure 2 visualizes the results from the contingency table between 
machine and human results in form of an agreement plot, where the observed and ex- 
pected diagonal elements are represented by superposed black and white rectangles, 
respectively. The plot indicates that the extraction heuristic works very well. This is 
supported by the very high Rand index of 0.94 and by the corrected Rand index of 
0 . 86 . 

Further improvements could be achieved by saving identified names of judges in 
order to identify them again in other documents. Of course, ideally information such 
as senate size would be provided as metadata by the legal information system, per- 
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haps even determined automatically by text mining methods for “most” documents 
(with a per-document measure of the need for verification by humans). 




0 3 5 



Human 



Fig. 2. Agreement plot of the contingency table between the senate size reported by text min- 
ing heuristics and the senate size reported by humans. 



4 Conclusion 

In this paper we have presented approaches to use text mining methods on (supreme 
administrative) court jurisdictions. We performed A:-means clustering and introduced 
keyword based clustering which works well for text corpora with well defined formu- 
lations as found in tax law related jurisdictions. We saw that the clustering works well 
enough to be used as a reasonable grouping for further investigation by law experts. 
Second, we investigated the classification of documents according to their relation to 
federal fiscal code regulations. We used both string kernels and term-document ma- 
trices with tf and tf-idf weighting as input for support vector machine based classifi- 
cation techniques. The experiments unveiled that employing term-document matrices 
yields both superior performance as well as fast running time. Finally, we considered 
a situation typical in working with specialized text corpora, i.e., we were looking for 
a specific property in each text corpus. In detail we derived the senate size of each ju- 
risdiction by analyzing relevant text phrases considering punctuation marks, copulas 
and regular expressions. Our results show that text mining methods can clearly aid 
legal experts to process and analyze their law document corpora, offering both con- 
siderable savings in time and cost as well as the possibility to conduct investigations 
barely possible without the availability of these methods. 
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Abstract. The manual acquisition and modeling of tourist information as e.g. addresses of 
points of interest is time and, therefore, cost intensive. Furthermore, the encoded information 
is static and has to be refined for newly emerging sight seeing objects, restaurants or hotels. 
Automatic acquisition can support and enhance the manual acquisition and can be imple- 
mented as a run-time approach to obtain information not encoded in the data or knowledge 
base of a tourist information system. In our work we apply unsupervised learning to the chal- 
lenge of web-based address extraction from plain text data extracted from web pages dealing 
with locations and containing the addresses of those. The data is processed by an unsupervised 
part-of-speech tagger (Biemann, 2006a), which constructs domain-specific categories via dis- 
tributional similarity of stop word contexts and neighboring content words. In the address 
domain, separate tags for street names, locations and other address parts can be observed. To 
extract the addresses, we apply a Conditional Random Field (CRF) on a labeled training set of 
addresses, using the unsupervised tags as features. Evaluation on a gold standard of correctly 
annotated data shows that unsupervised learning combined with state of the art machine learn- 
ing is a viable approach to support web-based information extraction, as it results in improved 
extraction quality as compared to omitting the unsupervised tagger. 



1 Introduction 

When setting up a Natural Language Processing (NLP) system for a specific domain 
or a new task, one has to face the acquisition bottleneck: creating resources such 
as word lists, extraction rules or annotated texts is expensive due to high manual 
effort. Even in times where rich resource repositories exist, these often do not con- 
tain material for very specialized tasks or for non-English languages and, therefore, 
have to be created ad-hoc whenever a new task has to be solved as a component of 
an application system. All methods that alleviate this bottleneck mean a reduction 
in time and cost. Here, we demonstrate that unsupervised tagging substantially in- 
creases performance in a setting where only limited training resources are available. 
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As an application, we operate on automatic address extraction from web pages for 
the tourist domain. 

1.1 Motivation: Address extraction from the web 

In an open-domain spoken dialog system, the automatic learning of ontological con- 
cepts and corresponding relations between them is essential as a complete manual 
modeling of them is neither practicable nor feasible due to the continuously chang- 
ing denotation of real world objects. Therefore, the emergence of new entities in the 
world entails the necessity of a method to deal with those entities in a spoken dialog 
system as described in Loos (2006). 

As a use case to this challenging problem we imagine a user asking the dialog 
system for a newly established restaurant in a city, e.g. (“How do I get to the Auer- 
stein"). So far, the system does not have information about the object and needs the 
help of an incremental learning component to be able to give the demanded answer 
to the user. A classification as well as any other information for the word “Auerstein" 
are hitherto not modeled in the knowledge base and can be obtained by text mining 
methods as described in Faulhaber et al. (2006). As soon as the object is classified 
and located in the system’s domain ontology, it can be concluded that it is a building 
and that all buildings have addresses. At this stage the herein described work comes 
into play, which deals with the extraction of addresses in unstructured text. With a 
web service (as part of the dialog system’s infrastructure) the newly found address 
for the demanded object can be used for a route instruction. 

Even though structured and semi-structured texts such as online directories can 
be harvested as well, they often do not contain addresses of new places and do, 
therefore, not cover all addresses needed. However, a search in such directories can 
be used in combination with the method described herein, which can be used as a 
fallback solution. 

1.2 Unsupervised learning supporting supervised methods 

Current research in supervised approaches to NLP often tries to reduce the amount 
of human effort required for collecting labeled examples by defining methodologies 
and algorithms that make a better use of the training set provided. Another promis- 
ing direction to tackle this problem is to empower standard learning algorithms by 
the addition of unlabeled data together with labeled texts. In the machine learning 
literature, this learning scheme has been called semi-supervised learning (Sarkar and 
Haffari, 2006). The underlying idea behind our approach is that syntactic and seman- 
tic similarity of words is an inherent property of corpora, and that it can be exploited 
to help a supervised classifier fo build a belter categorization hypothesis, even if the 
amount of labeled training data provided for learning is very low. We emphasize 
that every contribution to widening the acquisition bottleneck is useful, as long as 
its application does not cause more extra work than the contribution is worth. Here, 
we provide a methodology to plug an unsupervised tagger into an address extraction 
system and measure its contribution. 
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2 Data preparation 

In our semi-supervised setting, we require two different data sets: a small, manually 
annotated dataset used for training our supervised component, and a large, unanno- 
tated dataset for training the unsupervised part of the system. This section describes 
how both datasets were obtained. For both datasets we used the results of Google 
queries for places as restaurants, cinemas, shops etc. To obtain the annotated data 
set, 400 of the resulting Google pages for the addresses of the corresponding named 
entities were annotated manually with the labels: street, house, zip and city, all 
other tokens received the label 0. 

As the unsupervised learning method is in need of large amounts of data, we used 
a list with about 20,000 Google queries each returning about 10 pages to obtain an 
appropriate amount of plain text. After filtering the resulting 700 MB raw data for 
German language and applying cleaning procedures as described in (Quasthoff et al., 
2006) we ended up with about 160 MB totaling 22.7 million tokens. This corpus was 
used for training the unsupervised tagger. 



3 Unsupervised tagging 

3.1 Approach 

Unlike in standard (supervised) tagging, the unsupervised variant relies neither on a 
set of predefined categories nor on any labeled text. As a tagger is not an application 
of its own right, but serves as a pre-processing step for systems building upon it, the 
names and the number of syntactic categories is very often not important. 

The system presented in Biemann (2006a) uses Chinese Whispers clustering 
(Biemann, 2006b) on graphs constructed by distributional similarity to induce a lex- 
icon of supposedly non-ambiguous words with respect to part of speech (PoS) by 
selecting only safe bets and excluding questionable cases from the category build- 
ing process. In this implementation two clusterings are combined, one for high and 
medium frequency words, the other collecting medium and low frequency words. 
High and medium frequency words are clustered by similarity of their stop word 
context feature vectors: a graph is built, including only words that are endpoints of 
high similar pairs. Clustering this graph of typically 5,000 vertices results in several 
hundred clusters, which are subsequently used as PoS categories. To extend the lex- 
icon, words of medium and low frequency are clustered using a graph that encodes 
similarity of significant neighbor co-occurrences (as defined in Dunning, 1993). Bofh 
clusterings are mapped by overlapping elements into a lexicon that provides PoS in- 
formation for some 50,000 words. 

For obtaining a clustering on datasets of this size, an effective algorithm like Chi- 
nese Whispers is crucial. Increased lexicon size is the main difference between this 
and other approaches (e.g. (Schiltze, 1995), (Freitag , 2004)), that typically operate 
with 5,000 words. Using the lexicon, a trigram tagger with a morphological exten- 
sion is trained, which can be used to assign tags to all tokens in a text. The tag sets 
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obtained with this method are usually more fine-grained than standard tag sets and 
reflect syntactic as well as semantic similarity. In Biemann (2006a), the tagger output 
was directly evaluated against supervised taggers for English, German and Finnish 
via information-theoretic measures. While it is possible to relatively compare the per- 
formance of different components of a system or different systems along this scale, 
it does only give a poor impression on the utility of the unsupervised tagger’s output. 
Therefore, an application-based evaluation is undertaken here. 

3.2 Resulting tagset 

As described in Section 2, we had a relatively small corpus in comparison to previ- 
ous work with the same tagger, that typically operates on about 50 million tokens. 
Nonetheless, the domain speciflty of the corpus leads to an appropriate tagging, 
which can be seen in the following examples from the resulting tag set (numbers 
in brackets give the words in the lexicon per tag): 

1. Nouns: Verhandlungen, Schritt, Organisation, Lesungen, Sicherung,... (800) 

2. Verbs: habe, lernt, wohnte, schien, hat, reicht, suchte... (191) 

3. Adjectives: /ranzd^wc/ien, kiinstlerischen, religidsen... (142) 

4. locations: Potsdam, Passau, Innsbruck, Ludwigsburg, Jena... (320) 

5. street names: Bismarckstr, Leonrodstr, Schillerstr, Ungererstr... (150) 

On the one hand, big clusters are formed that contain syntactic tags as shown 
for the example tags 1 to 3. Items 4 and 5 show that not only syntactic tags are 
created by the clustering process, but also domain specific tags, which are useful for 
an address extraction. Note that the actual tagger is capable of tagging all words, not 
only words in the lexicon - the number of words in the lexicon are merely the number 
of types used for training. We emphasize that the comparatively small training corpus 
(usually, 50M-500M tokens are employed) leaves room for improvements, as more 
training text showed to have a positive impact on tagging quality in previous studies. 



4 Experiments and evaluation 

This section describes the supervised system, the evaluation methodology and the 
results we obtained in a comparative evaluation of either providing or not providing 
the unsupervised tags. 

4.1 Conditional random field tagger 

We perceived address extraction as a tagging task: labels indicating city, street, 
house number, zip code or other (0) from the training set are learned and applied 
to unseen examples. Note that this is not comparable to a standard task like Named 
Entity Recognition (cf. Roth and van den Bosch, 2002), since we are only interested 
in labeling the address of the target location, and not other addresses that might be 
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contained in the same document. Rather, this is an instance of Information Extraction 
(see Grishman, 1997). For performing the task, we train the MALLET tagger (Mc- 
Callum, 2002), which is based on Conditional Random Fields (CRFs, see Lafferty 
et al. 2001). CRFs define a conditional probability distribution over label sequences 
given a particular observation sequence. CRFs have been proven to have equal or 
superior performance at tagging tasks as compared to other systems like Hidden 
Markov Models or the Maximum Entropy Framework. The flexibility of CRFs to in- 
clude arbitrary, non-independent features allows us to supply unsupervised tags or no 
tags to the system without changing the overall architecture. The tagger can operate 
on a different set of features ranging over different distances. The following features 
per instance are made available to the CRF: 

• word itself 

• relative position to target name 

• unsupervised tag 

We experimented with different orders as well as with different time shifts. 

CRF order 

The order of the CRF defines how many preceding labels are used for the determi- 
nation of the current label. An order of 1 means that only the previous label is used, 
order 2 allows for the usage of two previous labels etc. As higher orders mean more 
information, which is in turn supported by fewer training examples, an optimum at 
some small order can be expected. 

Time shifting 

Time shifting is an operation that allows the CRF to use not only the features for 
the current position, but also features from surrounding positions. This is reached 
by copying the features from surrounding positions, indicating what relative position 
they were copied from. As with orders, an optimum can be expected for some small 
range of time shifting, exhibiting the same information/sparseness trade-off. For il- 
lustration, the following listing shows an original training instance with time shift 0, 
as well as the same instance with time shifts -2, -1, 0, 1,2, for the scenario with un- 
supervised tags. Note that relative positions are not copied in time-shifting because 
of redundancy. The following items show these shifts: 

• shift 0: 

- Extrablatt 0 T 1 1 5 0 

- 53 1 T215 house 

- Hauptstr 2 T64 street 

- Heidelberg 3 T15 city 

- 69117 4T215 zip 

• shift 1: 

- 1 -FExtrablatt -1:T115 0:53 0:T215 FHauptstr 1:T64 house 
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- 2 -1:53 -LT215 0:Hauptstr 0:T64 LHeidelberg 1:T15 street 

• shift 2: 

- 1 -2:Cafe -2:T10 -LExtrablatt -1:T115 0:53 0:T215 LHauptstr 1:T64 2:Heidelberg 
2:T15 house 

- 2 -2:Extrablatt -2:T115 -1:53 -LT215 0:Hauptstr 0:T64 LHeidelberg LT15 2:69117 
2:T215 street 

In the example for shift 0 a full address with all features is shown: word, relative 
position to target "Extrablatt", unsupervised tag and classification label. For exem- 
plifying shifts 1 and 2, only two lines are given, with -2:, -1:, 0:, 1: and 2: being 
the relative position of copied features. In the scenario without unsupervised tags all 
features "T<number>" are omitted. 

4.2 Evaluation methodology 

For evaluation, we split the training set into 5 equisized parts and performed 5 sub- 
experiments per parameter setting and scenario, using 4 parts for training and the 
remaining part for evaluation in a 5-fold-cross-validation fashion. The split was per- 
formed per target location: locations in the test set were never contained in the train- 
ing. To determine our system’s performance, we measured the amount of correctly 
classified, incorrectly classified (false positives) and missed (false negatives) in- 
stances per class and report the standard measures Precision, Recall and FI -measure 
as described in Rijsbergen (1979). The 5 sub-experiments were combined and 
checked against the full training set. 

4.3 Results 

Our objective is to examine to what extent the unsupervised tagger influences clas- 
sification results. Conducting the experiments with different CRF parameters as out- 
lined in Section 4. 1 , we found different behaviors for our four target classes: whereas 
for street and house number, results were slightly better in the second order CRF 
experiments, the first order CRF scored clearly higher for city and zip code. Re- 
stricting experiments to first order CRFs and regarding different shifts, a shift of 2 
in both directions scored best for all classes except city, where both shift 0 and 1 
resulted in slightly higher scores. The best overall setting, therefore, was determined 
to be the first order CRF with a shift of 2. For this setting. Figure 1 presents the 
results in terms of precision, recall and FI. 

What can be observed not only from Figure 1 but also for all parameter settings 
is the following: Using unsupervised tags as features as compared to no tagging 
leads to a slightly decreased precision but a substantial increase in recall, and always 
affects the FI measure positively. The reason can be sought in the generalization 
power of the tagger: having at hand syntactic-semantic tags instead of merely plain 
words, the system is able to classify more instances correctly, as the tag (but not the 
word) has occurred with the correct classification in the training set before. Due to 
overgeneralization or tagging errors, however, precision is decreased. The effect is 
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Performance for first order CRF, shift 2 




untagged |tagged 
city 

0,938 0,906 

0,613 0,678 

0,742 0,776 



Fig. 1. Results in precision, recall and FI for all classes, obtained with first order CRF and a 
shift of 2. 



strongest for street with a loss of 7% in precision with a recall boost of 14%. 

In general, unsupervised tagging clearly helps at this task, as a little loss in precision 
is more than compensated with a boost in recall. 



5 Conclusion and further work 

In this research we have shown that the use of large, unannotated text can improve 
classification results on small, manually annotated training sets via building a tag- 
ger model with unsupervised tagging and using the unsupervised tags as features in 
the learning algorithm. The benefit of unsupervised tagging is especially significant 
in domain-specific settings, where standard pre-processing steps such as supervised 
tagging do not capture the abstraction granularity necessary for the task, or simply no 
tagger for the target language is available. For further work, we aim at combining the 
possibly several addresses per target location. Given the evaluation values obtained 
with our method, the task of dynamically extracting addresses from web-pages to 
support address search for the tourist domain is feasible and a valuable, dynamic 
add-on to directory-based address search. 
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Abstract. In this paper we present an unsupervised method to deal with the classification of 
out-of-vocabulary words in open-domain spoken dialog systems. This classification is vital to 
ameliorate the human-computer interaction and to be able to extract additional information, 
which can be presented to the user. We propose a two-stage approach for interpreting named 
entities in a document corpus: to cluster documents dealing with a particular named entity 
and to classify it with the help of structural and contextual information in these documents. 
The idea is to take the resulting websites from a search engine queried for a named entity 
as documents and to cluster those which are semantically similar. Named entities can then 
he classified with the information contained in the clusters. Our evaluation showed that the 
precision of the classification task was as high as 64.47%. 



1 Introduction 

Open-domain spoken dialog systems need to deal with the classification of out-of- 
vocabulary (OOV) words to be able to give the user the requested information and to 
ameliorate the human-computer interaction. Therefore, an approach is needed which 
semantically classifies those OOV words. For our approach we worked with named 
entities, as these are the class of words which are most likely to be new to the dialog 
system. 

The presented approach combines a clustering of a document corpus with a 
method to find hypernyms of named entities in document clusters. For a list of 
named entities denominating locations in German cities the resulting web pages of 
the Google search engine are cached (e.g. Lotus, Merlin etc.). 

With the help of a document clustering the websites are divided into clusters of 
similar contents. These clusters are then used for an approach of hypernym extraction 
to classify the named entities. An example could be the named entity “Lotus", which 
is not only a restaurant in Heidelberg but also the trademark of a car and a software. 
Our approach would split the resulting website texts into three clusters and classify 
the named entities depending on the textual context. 
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In the next section the steps for the document clustering task are presented and 
in Section 3 the consecutive hypernym extraction is described. 



2 Document clustering 

For a separation of the hypernym candidates of the named entity it is necessary to 
have the documents of the corpus in different groups according to the context. 

We apply the cluster algorithms Clique and the non-hierarchical Single-Link. The 
Clique Algorithm takes documents into the same cluster which have pairwise simi- 
larity to each other. The result are many small clusters which share some documents. 
In the Single-Link algorithm a document only needs some kind of similarity to one 
of the documents of the cluster. The result are comparatively few big clusters. (See 
Subsection 2.2 for a detailed description of the single steps processed by the algo- 
rithms.) 

The evaluation will therefore show into which direction to go with respect to 
clustering approaches in the future. 

2.1 Data preparation 

In the preprocessing step standard term vectors where established using the Porter 
Stemmer. The similarity is calculated with the cosine coefficient as shown in For- 
mula (1). 



cos(.r , y ) 



X • y 

1 ^ 1 - 13^1 




E n 

i=\Xiyi 






E n 2 

i=iyi 



( 1 ) 



The higher the calculated value the more similar are two documents to each other. 
The similarity between all possible combinations results in the Document Document 
Relation Matrix (DDRM). 

The Document-Document Similarity Matrix (DDSM) can be prepared with the 
help of the DDRM and a threshold value. The result is a boolean matrix, which has 
entries of one for similarity values which are higher than or equal to the threshold 
and zero for cases in which it is lower. 



2.2 The clustering algorithms 

The clustering algorithms applied with the DDSM in this work are Single-Link and 
Clique. 

The Single-Link algorithm as described by Kowalski (1997) works in four steps: 
First, it chooses a document d of the remaining documents and adds it to a new 
document cluster. Second, it adds all documents which are similar to d according to 
the DDSM to the recent cluster. Third, it performs the second step for each document 
which was added to the recent cluster. And last, if there are no more documents which 
will be added to the recent cluster, it performs the first step, otherwise it terminates. 
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The Clique Clustering algorithm described by Koch (2001) finds document clus- 
ters by creating a seed-list with similar documents starting with an initial document. 
As soon as the seed-list consists of all similar documents it is declared to be a cluster. 
This procedure is done for all documents and, therefore, all finally belong to any of 
the created clusters. 

These two algorithms were chosen as the resulting clusters are quite different 
to each other. Clique differs significantly from Single Link, since Clique produces 
smaller and more clusters than Single Link. A cluster established by Clique con- 
tains always pairwise similar documents. Hence, all documents within the cluster 
are similar to each other. In order to add a document d into a Single-Link cluster, it 
is sufficient that d is similar to only one of the documents belonging to the recent 
cluster. 



3 Hypernym extraction 

According to Lyons (1977) hyponymy is the relation which holds between a more 
specific lexeme (i.e. a hyponym) and a more general one (i.e. a hypernym). E.g. 
animal is a hypernym of cat. Hypernym Extraction (HE) is applied in cases where 
the hypernym of a given noun or named entity has to be found for example as part of 
an ontology learning framework. 

After the documents of the corpus are divided into different clusters the HE can 
take place separately for all of the clusters. For this approach a Part-of-Speech Tagger 
provides the part-of-speech tags for all terms. The hypernyms of named entities are 
generally nouns and therefore only nouns are considered in the extraction. Three 
approaches were therefore considered resulting in three vectors, which are lateron 
consolidated: the frequency of a term in the neighborhood of the named entity; the 
distance of a term to the named entity; and the existence of a lexico-syntactic pattern 
indicating the hypernym/hyponym relationship as proposed by Hearst (1992). 

Hearst used the notion of the hypernym/ hyponym relationship pragmatically 
when referring to named entities and similar to Miller et al. (1990) who stated that 
“a concept represented by a lexical item Lq is said to be a hypernym of the concept 
represented by a lexical item L\ if native speakers of English accept sentences con- 
structed from the frame An Lq is a kind ofLi. Here Li is the hypernym of Lq and the 
relationship is reflexive and transitive, but not symmetric." 

No distinction is made between the relationship of nouns and named entities 
to more general terms. This stands in contrast to the terminology of ontologies. 
Here this relationship is is-a between classes (corresponding to nouns on language 
level) and instance-of for the relation between classes and instances(corresponding 
to named entities). 

3.1 Term frequency 

For each of the clusters a unique list of nouns occurring in the documents belonging 
to a cluster is extracted. This list contains all possible nouns (hypernym candidates) 
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and, therefore, serves as a basis to establish the Named-Entity-Term- Vector (NETV). 
The NETV is a vector, which contains a value for each noun (hypernym candidate) 
in the unique list. The value is calculated by the cosine coefficient (as shown in 
Eormula 1) and signifies the co-occurrence of a hypernym candidate and the named 
entity based on term frequency. 



3.2 Term distance 



The term distance approach takes the notion into account that smaller distances be- 
tween hypernym candidate and named entity signify a more probable hypernym rela- 
tion. Hence, smaller distances are considered to be more valuable and are, therefore, 
preferred. 

An example is the following German sentence: 

• Das Hotel Auerstein befindet sich verkehrstechnisch gunstig im nordlichen Hei- 
delberger Stadtteil Handschuhsheim. (In English: The Hotel Auerstein is located 
in direct access from the city center of Heidelberg in the northern neighborhood 
Handschuhsheim. ) 



Therefore, a NETV of dimension p can be built, where p is the number of terms 
in the unique-list. The entries for the vector are computed by calculating the distance 
weights as described in the following: Eirst, a parameter value for the highest pos- 
sible distance of a hypernym candidate and the named entity is identified as shown 
in Figure 3 in the Evaluation Section. It appeared that the results are most promising 
for the distance ofp = 8. 

The average distance weight v„ of the pairwise occurrence of a hypernym n and 
the named entity i is calculated according to Formula 2, where w, is the weight of the 
named entity. 




( 2 ) 



As the NE occurs more than once in the documents, all occurrences and their 
neighborhood have to be taken into account for the calculation. Therefore, an average 
value of all distances „ between any i and the occurrence of a hypernym candidate 
n in the neighborhood of i are calculated which are defined by parameter p. The 
single distance weights are calculated with Formula 3. 



Wi,n = 



0 , else 



( 3 ) 



3.3 Lexico-syntactic patterns 

To take not only statistical methods into account, we tested the results for lexico- 
syntactic patterns according to Hearst (1992). Therefore, we developed a boolean 
named-entity-term-vector. Even though the detection of lexico-syntactic patterns is 
not frequent, the probability that once found patterns are correct is high. 
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3.4 Weighting and consolidation 



From the three described methods for hypernym extraction result three NETV s with 
the same dimension, which are consolidated to one vector. As the probability of 
correctness for once found lexico-syntactic Patterns is high, the weighting of them 
is also high. Nonetheless, the weighting of the others is taken into account even if a 
lexico-syntactic pattern is found. 

The following formula serves for the calculation of the consolidated NETV, 
where h is the NETV for the lexico-syntactic patterns, for the term frequency/, for 
the term distance b and wi, W 2 , W 3 the weights, which are used as parameters in the 
evaluation: 



k = 



w\ ■h + W 2 ' f + w^-b 
3 



(4) 



According to the entries of the consolidated NETV the most probable hypernym 
candidate can be chosen. 



4 Evaluation 

Eor the evaluation setup we extracted websites from Google for 90 named entities, 
which resulted in 90 corpora with each including 10 to 20 documents. For a Gold- 
standard all of these were annotated manually for hypernyms by two annotators. Fur- 
thermore, they marked the corpora which include ambiguous named entities, which 
can be used for the evaluation of the document clustering task. These documents 
were clustered manually for similar documents. 
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Fig. 1. Percentage of found meanings for Single-Link and Clique 
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4.1 Evaluation of the clustering task 

For the clustering task it appeared that the choice of the clustering algorithm was 
important for the results and was, therefore, chosen as a parameter. Furthermore, the 
choice of a good threshold value is important for the establishment of the DDSM. 
This parameter is referred to as threshold. For testing it was evaluated for the range 
between 0 and 1 with increments of 0.1. 

Two metrics were responsible for the evaluation of the clustering task: The prob- 
ability that all different meanings for a named entity were found with the application 
of a clustering approach with a specific threshold value and the recall of automat- 
ically correctly clustered documents. The first one is referred to as average found 
meanings in the following. 

Figure 1 shows the results for average found meanings for the two cluster algo- 
rithms depending on the threshold value. We averaged over all named entities we 
had. Found meanings refers to the clusters in which the meaning was contained and 
could therefore be found in a later hypernym extraction. It appeared that for a thresh- 
old value of 0.5 the results of Clique outperformed Single-Link considerably as well 
as for the recall (as shown in Figure 2). For the recall we calculated how many of the 
documents which are assigned to one cluster should actually be there. 

For the analysis of an optimal threshold value it is necessary that only clusterings 
are analyzed which consist of clusters indicated by manual annotation to be clusters. 
The precision of the clustering task has to be 100% as only this can yield reliable 
results for the hypernym extraction. 
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Fig. 2. Recall for Single-Link and Clique 
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4.2 Evaluation of the Hypernym Extraction task 

For the Hypernym Extraction (HE) task the formula for weighting the nouns in the 
neighborhood of the NE yields the best results. This parameter is referred to as neigh- 
borRelevance. 

The evaluation of the neighborRelevance parameter showed that a window of 
eight words surrounding the NE yielded the best results as shown in Eigure 3. This 
means, that if a window of eight words surrounding the named entity is chosen, the 
best results are attained. Nonetheless, it should be taken into account that the analysis 
of shorter snippets is cheaper and therefore also the comparatively good results for 
a value of 4 should be kept in mind for performance reasons. The formula for the 
calculations is described in Subsection 3.2. 



Evaluation of uekjhborRehvance 
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Fig. 3. Evaluation for neighbor relevance 



The precision for the HE task depending on the value of the parameter 
amountOfExtractedHypernyms, which refers to the number of hypernyms given by 
the HE module, were 64.47% for value 1, 77.63% for value 2 and 84.21% for value 3. 
The results vary from the ones of the evaluation for neighbor relevance due to slightly 
changed parameter values. Overall we had results which outperformed earlier devel- 
oped methods as described in Eaulhaber et al. (2006) for hypernym extraction by 
about 4% (absolute). 

Table 1 shows the results for the best parameter choice according to our evalua- 
tion for a combination of the modules for clustering and HE which we obtained by 
empirical evaluation. These results of parameter values are not only of interest for 
the described approaches but also generally for the tasks of document clustering and 
hypernym extraction. The parameter maxWeight is the sum of the three parameters 
hearstWeight, termDistanceWeight and termFrequenceWeight. 
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Table 1. Parameter Value Selection 



Parameter 


Value 


Algorithm 


Clique 


Threshold 


0.5 


maxWeight 


30 


termFrequence Weight 


16 


termDistanceWeight 


11 


hearstWeight 


2 


neighborRelevance 


8 



5 Conclusion and future work 

The results show that unsupervised learning is a viable approach for 
context-dependent hypernym extraction. In the future more cluster algorithms are 
to be analyzed and evaluated to obtain a higher recall. 

The goal of our work is to integrate these components into an incremental on- 
tology learning framework. In case a user asks for a named entity not known to the 
system, it should find the appropriate class in the system’s ontology. Therefore, the 
found hypernyms are transfered into ontological concepts. 
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Abstract. In this contribution we focus on dwell times a user spends on various areas of a web 
site within a session. We assume that dwell times may be adequately modeled by a Weibull 
distribution which is a flexible and common approach in survival analysis. Furthermore we 
introduce heterogeneity by various parameterizations of dwell time densities by means of 
proportional hazards models. According to these assumptions the observed data stem from 
a mixture of Weibull densities. Estimation is based on EM-algorithm and model selection 
may be guided by BIC. Identification of mixture components corresponds to a segmentation 
of users/sessions. A real life data set stemming from the analysis of a world wide operating 
eCommerce application is provided. The corresponding computations are performed with the 
mixPHM package in R. 



1 Introduction 

Web Usage Mining focuses on the analysis of visiting behavior of users on a web 
site. Common starting point are the so called click-stream data which are derived 
from weh-server logs and may be viewed as the electronic trace a user leaves on a 
web site. Adequate modeling of the dynamics of browsing behavior is of particular 
relevance for the optimization of eCommerce applications. Recently Montgomery 
et al. (2004) proposed a dynamic multinomial probit model of navigation patterns 
which lead to an remarkable increase of conversion rates. Park and Fader (2004) de- 
veloped multivariate exponential-gamma models which enhance cross-site customer 
acquisition. These papers indicate the potential that such approaches offer for web- 
shop providers. 

In this paper we will focus on modeling dwell times, i.e., the time a user spends 
for viewing a particular page impression. They are defined by the time span between 
two subsequent page requests and can be calculated by taking the difference between 
the two logged time points when the page request have been issued. For the analysis 
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of complex web sites which consist of a large number of pages it is often reasonable 
to reduce the number of different pages by aggregating individual page-impressions 
to semantically related page categories reflecting meaningful regions of the web site. 

Analysis of dwell times is an important source of information with regard to 
the relevance of the content for different users and the effectiveness of the page in 
attracting visitors. In this paper we are particularly interested in segmentation of 
users into various groups which exhibit a similar behavior with regard to the dwell 
times they spend on various areas of the site. Such a segmentation analysis is an 
important step towards a better understanding of the way a user interacts on a web 
site. It is therefore of relevance with regard to the prediction of user behavior as well 
as for a user-specific customization or even personalization of web sites. 



2 Model specification and estimation 

2.1 Weibull mixture model 

Since survival analysis focuses on duration times until some event occurs (e.g. the 
death of a patient in medical applications) it seems straightforward to apply these 
concepts to the analysis of dwell times in web usage mining applications. 

With regard to dwell time distributions we assume that they follow a Weibull 
distribution with density function f{t) = exp(— where X is a scale pa- 

rameter and Y the shape parameter. For modeling the heterogeneity of the observed 
population, we assume K latent segments of sessions. While the Weibull assumption 
holds within all segments, different segments exhibit different parameter values. This 
leads to the underlying idea of a Weibull mixture model. For each page category p 
(p= 1 , . . . , P) under consideration the resulting mixture has the following form 

K K 

fi^p) ~ 'y '. '^kfi^p^^pk^ypk) ~ 'y pk^p ^pk^p ) ( 1 ) 

k=l k=l 

where tp represents the dwell time on page category p with mixing proportions nt 
which correspond to the relative size of each segment. 

In order to reduce the number of parameters involved we impose restrictions 
on the hazard rates of different components of the mixture respectively pages. An 
elegant way of doing this is offered by the concept of Weibull proportional hazards 
models (WPHM). The general formulation of a WPHM (see e.g., Kalbfleisch and 
Prentice (1980)) is 

h{t;Z) = exp(Z(3). (2) 

where Z is a matrix of covariates, and (3 are the regression parameters. The term 
Xyt^^' is the baseline hazard rate ho{t) due to the Weibull assumption and h{f,Z) 
hazard proportional to ho{t) resulting from the regression part in the model. 
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2.2 Parsimonious modeling strategies 

We propose five different models with respect to different proportionality restrictions 
in the hazard rates as to reduce the number of parameters. In the mixPHM package by 
Mair and Hudec (2007) the most general model is called separate: The WPHM is 
computed for each component and page separately. Hence, the hazard of session i 
belonging to component k{k= 1 , . . . , ) on page category p (p = 1 , . . . , P) is 

h{tiy,l)=Xk^pyk,ptJ^p 'exp(Pl). (3) 

The parameter matrices can be represented jointly as 




for the scale parameters and 




(4) 



(5) 



for the shape parameters. Both the scale and the shape parameters can vary freely 
and there is no assumption of hazard proportionality in the separate model. In fact, 
the parameters (2x K x P in total) are the same as they were estimated directly by 
using a Weibull mixture model. 

Next, we impose a proportionality assumption across the latent components. In 
the classification version of the EM-algorithm (see next section) in each iteration step 
we have a “crisp" assignment of each session to a component. Thus, if we consider 
this component vector g as main effect in the WPHM, i.e., h{t\g), we impose pro- 
portional hazards for the components across the pages (main.g in mixPHM). Again, 
the elements of the matrix A of scale parameters can vary freely, whereas the shape 
parameter matrix reduces to the vector T = (Yi.i, • • ■ Thus, the shape param- 

eters are constant over the components and the number of parameters is reduced to 
KxP + P. 

If we impose page main effects in the WPHM, i.e., h{t\p) or main.p, respec- 
tively, as before, the elements of A are not restricted at all but this time the shape 
parameters are constant over the pages, i.e., F = (Yi,i, • • ■ jYi.a:)- The total number of 
parameters is now KxP + K. 

For the main-effects model h{t\g + p) we impose proportionality restrictions on 
both A and F such that the total number of parameters is reduced to K + P. For 
the scale parameter matrix proportionality restrictions of this main.gp model hold 
row-wise as well as column-wise: 
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The c- and rf-scalars are proportionality constants over the pages and components, re- 
spectively. The shape parameters are constant over the components and pages. Thus, 
r reduces to one shape parameter y which implies that the hazard rates are propor- 
tional over components and pages. 

To relax the rather restrictive assumption with respect to A we can extend the 
main effects model by the corresponding component-page interaction term, i.e., 
h{f,g* p). In mixPHM notation this model is called int.gp. The elements of A can 
vary freely whereas T is again reduced to one parameter only, leaving us with a total 
number of parameters of A" x P-f 1. With respect to the hazard rate this relaxation 
implies again proportional hazards over components and pages. 

2.3 EM-estimation of parameters 

In order to estimate such mixtures of WPHM, we use the EM-Algorithm (Demp- 
ster et al. (1977), McLachlan and Krishnan (1997)). In the E-Step we establish the 
expected likelihood values for each session with respect to the K components. At 
this point it is important to take into account the probability that a session i of com- 
ponent k visits page p, denoted by Pr^ p, which is estimated by the corresponding 
relative frequency. The elements of the resulting Kx P matrix are model parameters 
and have to be taken into account when determining the total number of parameters. 
The resulting likelihood Tk,p{si) for session i being in component k for each page p 
individually, is 



/ q ^ f f{yp'Xk,p,%,p)Prk,p{si) if p was visited by 5,- 
,p\ i) ^ 1 — Prj^ p(si) if p was not visited by Si 

To establish the joint likelihood, a crucial assumption is made: independence 
of the dwell times over page-categories. To make this assumption feasible, a well- 
advised page categorization must be established. For instance, if some page-categories 
would be hierarchical, the independence assumption would not hold. Without this 
independence assumption, a multivariate mixture Weibull model would have to be 
fitted which takes into account the covariance structure of the observations. This 
would require that each session must have a full observation vector of length p, i.e, 
each page category is visited within each session which seems not to be realistic 
within the context of dwell times in web usage mining. 

However, for a reasonable independence assumption the likelihood over all pages 
that session i belongs to component k is given by 

p 

/>=! 

Thus, by looking at each session i separately, a vector of likelihood values 
T',- = {Li{si),L 2 {si),. . . ,Lk{si)) results. 

At this point, the M-step is carried out. The mixPHM package provides three 
different methods. The classical version of the EM-algorithm {maximization EM\ 
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EMoption = "maximization" in mixPHM) computes the posterior probabilities that 
session i belongs to group k and does not make a group assignment within each 
iteration step but rather updates the matrix of posterior probabilities Q. A faster EM- 
version is proposed by Celeux and Govaert (1992) which they call classification EM 
(EMoption = "classification" in mixPHM): Within each iteration step a group 
assignment is performed due to sup^('Pj). Hence, the computation of the posterior 
matrix is not needed. A randomized version of the M-step considers a combination 
of the approaches above: After the computation of the posterior matrix Q, a ran- 
domized group assignment is performed due to the corresponding probability values 
(EMoption = "randomization"). 

As usual, the joint likelihood L is updated at each EM-iteration / until a certain 



convergence criterion e is reached, i.e.. 



Z,(0_l('-1) 



< e. Theoretical issues about 



the EM-convergence in Weibull mixture models can be found in Ishwaran (1996) 
and Jewell (1982). 



3 Real life example 

In this section we use a real dataset of a large Austrian company which runs a web- 
shop to demonstrate our modeling approach. We restrict empirical analysis to a sub- 
set of 333 buying-sessions and 7 page-categories we perform a dwell time based clus- 
tering with corresponding proportionality hazard assumptions by using the mixPHM 
package in R (R Development Core Team, 2007). 
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The above extract of the data matrix shows the dwell times of 5 sessions, while we 
coded non- visited page categories as O’s. 

We start with a rather exploratory approach to determine an appropriate propor- 
tionality model with an adequate number of clusters K. By using the msBIC statement 
we can accomplish such a heuristic model search. 

> res.bic <- msBIC (x, K=2 : 5 , method= "all " ) 

> res.bic 

Bayes Information Criteria 
Survival distribution: Weibull 

K = 2 K=3 K = 4 K=5 

separate 23339.27 23202.23 23040.01 22943.11 

main.g 23355.66 23058.25 22971.86 22863.43 

main.p 23503.73 23368.77 23165.60 23068.47 

int.gp 23572.21 23422.51 23305.63 23075.76 

main.gp 23642.74 23396.51 23271.72 23087.64 
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It is obvious that the main . g model with K = 5 components fits quite well com- 
pared to the other models (if we fit models for K > 5 the BIC’s do not decrease 
perspicuously anymore). For the sake of demonstration of the imposed hazard pro- 
portionalities, we compare this model to the more flexible separate model. First, we 
fit the two models again by using the phmclust statement which is the core routine 
of the mixPHM package. The matrices of shape parameters Tsep and Fg, respectively, 
for the first 5 pages (due to limited space) are: 

> res.sep <- phmclust (x, 5 ,method= " separate" ) 

> res ,sep$ shape [ , 1:5] 

bestview checkout service figurines jewellery 
Componentl 3.686052 2.692687 0.8553160 0.9057708 1.2503048 
Component2 1.327496 3.393152 1.6260679 0.9716507 0.9941698 
Component3 1.678135 2.829635 1.0417360 1.0706117 0.6902553 
Component! 1.067241 1.847353 0.9860697 0.9339892 0.6321027 
Component5 1.369876 2.030376 1.4565000 0.6434554 1.2414859 

> res.g <- phmclust (x, 5 , method= "main . g" ) 

> res . g$ shape [ , 1:5] 

bestview checkout service figurines jewellery 
Componentl 1.362342 2.981528 1.116042 0.7935599 0.9145463 
Componentl 1.362342 2.981528 1.116042 0.7935599 0.9145463 
Component3 1.362342 2.981528 1.116042 0.7935599 0.9145463 
Component! 1.362342 2.981528 1.116042 0.7935599 0.9145463 
Component5 1.362342 2.981528 1.116042 0.7935599 0.9145463 

The shape parameters in the latter model are constant across components. As a 
consequence, page-wise within group hazard rates can vary freely for both models, 
while the group- wise within page hazard rates can cross only for the separate model 
(see Figure 1). 

From Figure 2 it is obvious that the hazards are proportional across components 
for each page. Note that due to space limitations, in both plots we only used three 
selected pages to demonstrate the hazard characteristics. The hazard plots allow to 
asses the relevance of different page categories with respect to cluster formation. 
Similar plots for dwell time distributions are available. 



4 Conclusion 

In this work we presented a flexible framework to analyze dwell times on web pages 
by adopting concepts from survival analysis to probability based clustering. Unob- 
served heterogeneity is modeled by mixtures of Weibull distributed dwell times. Ap- 
plication of the EM-algorithm leads to a segmentation of sessions. 

Since the Weibull distribution is rather highly parameterized it offers a size- 
able amount of flexibility for the hazard rates. A more parsimonious modeling may 
either be achieved by posing proportionality restrictions on the hazards or mak- 
ing use of simpler distributional assumptions (e.g., for constant hazard rates). The 
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Fig. 1. Hazard Plot for Model separate 



mixPHM package covers therefore additional survival distributions such as Exponen- 
tial, Rayleigh, Gaussian, and Log-logistic. 

A segmentation of sessions as it is achieved by our method may serve as a starting 
point for optimization of a website. Identification of typical user behavior allows an 
efficient dynamic modification of content as well as an optimization of adverts for 
different groups of users. 
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Fig. 2. Hazard Plot for Model main . g 
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Abstract. Near-duplicate detection is the task of identifying documents with almost identical 
content. The respective algorithms are based on fingerprinting; they have attracted consider- 
able attention due to their practical significance for Web retrieval systems, plagiarism analysis, 
corporate storage maintenance, or social collaboration and interaction in the World Wide Web. 

Our paper presents both an integrative view as well as new aspects from the field of near- 
duplicate detection: (i) Principles and Taxonomy. Identification and discussion of the princi- 
ples behind the known algorithms for near-duplicate detection, (ii) Corpus Linguistics. Pre- 
sentation of a corpus that is specifically suited for the analysis and evaluation of near-duplicate 
detection algorithms. The corpus is public and may serve as a starting point for a standard- 
ized collection in this field. (Hi) Analysis and Evaluation. Comparison of state-of-the-art al- 
gorithms for near-duplicate detection with respect to their retrieval properties. This analysis 
goes beyond existing surveys and includes recent developments from the field of hash-based 
search. 



1 Introduction 

In this paper two documents are considered as near-duplicates if they share a very 
large part of their vocabulary. Near-Duplicates occur in many document collections, 
from which the most prominent one is the World Wide Web. Recent studies of Fet- 
terly et al. (2003) and Broder et al. (2006) show that about 30% of all Web doc- 
uments are duplicates of others. Zobel and Bernstein (2006) give examples which 
include mirror sites, revisions and versioned documents, or standard text building 
blocks such as disclaimers. The negative impact of near-duplicates on Web search 
engines is threefold: indexes waste storage space, search result listings can be clut- 
tered with almost identical entries, and crawlers have a high probability of exploring 
pages whose content is already acquired. 

Content duplication also happens through text plagiarism, which is the attempt 
to present other people’s text as own work. Note that in the plagiarism situation 
document content is duplicated at the level of short passages; plagiarized passages 
can also be modified to a smaller or larger extent in order to obscure the offense. 




602 Potthast, Stein 



Aside from deliberate content duplication, copying happens also accidentally: 
in companies, universities, or public administrations documents are stored multiple 
times, simply because employees are not aware of already existing previous work 
(Forman et al. (2005)). A similar situation is given for social software such as cus- 
tomer review boards or comment boards, where many users publish their opinion 
about some topic of interest: users with the same opinion write essentially the same 
in diverse ways since they read not all existing contributions. 

A solution to the outlined problems requires a reliable recognition of 
near-duplicates - preferably at a high runtime performance. These objectives com- 
pete with each other, a compromise in recognition quality entails dehciencies with 
respect to retrieval precision and retrieval recall. A reliable approach to identify two 
documents d and dq as near-duplicates is to represent them under the vector space 
model, referred to as d and dq, and to measure their similarity under the Z 2 -norm 
or the enclosed angle, d and dq are considered as near-duplicates if the following 
condition holds: 

cp(d, d^)>l — e wlth0<e(l, 

where cp denotes a similarity function that maps onto the interval [0, 1] . To achieve 
a recall of 1 with this approach, each pair of documents must be analyzed. Likewise, 
given dq and a document collection D, the computation of the set Dq,Dq<Z D, with all 
near-duplicates of dq in D, requires 0(|D|), say, linear time in the collection size. The 
reason lies in the high dimensionality of the document representation d, where “high” 
means “more than 10”: objects represented as high-dimensional vectors cannot be 
searched efficiently by means of space partitioning methods such as kd-trees, quad- 
trees, or /?-trees but are outperformed by a sequential scan (Weber et al. (1998)). 
By relaxing the retrieval requirements in terms of precision and recall the runtime 
performance can be significantly improved. Basic idea is to estimate the similarity 
between d and dq by means of hngerprinting. A fingerprint, Fd, is a set of k numbers 
computed from d. If two fingerprints, Fd and Fd^, share at least K numbers, K<k, it 
is assumed that d and dq are near-duplicates. I. e., their similarity is estimated using 
the Jaccard coefficient: 

■j— —>^ F((p(d,dfl) > 1 — e) is close to 1 

\Fdi^Fd^\~k j 

Let Fd = UdeD denote the union of the hngerprints of all documents in D, let 
2) be the power set of D, and let /r : Fb — > 2), xi-^ be an inverted file Index that 
maps a number jr e Fb on the set of documents whose fingerprints contain x; (i{x) is 
also called the postlist of x. For document dq with fingerprint Fd^ consider now the set 
DqdD of documents that occur in at least K of the postlists /d{x),xG Fd ^ . Put another 
way, Dq consists of documents whose fingerprints share a least K numbers with Fd ^ . 
We use Dq as a heuristic approximation of Dq, whereas the retrieval performance, 
which depends on the hnesse of the fingerprint construction, computes as follows: 



DqFDq 
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Fig. 1. Taxonomy of fingerprint construction methods (left) and algorithms (right). 



The remainder of the paper is organized as follows. Section 2 gives an overview 
of fingerprint construction methods and classifies them in a taxonomy, including so 
far unconsidered hashing technologies. In particular, different aspects of fingerprint 
construction are contrasted and a comprehensive view on their retrieval properties 
is presented. Section 3 deals with evaluation methodologies for near-duplicate de- 
tection and proposes a new benchmark corpus of realistic size. The state-of-the-art 
fingerprint construction methods are subject to an experimental analysis using this 
corpus, providing new insights into precision and recall performance. 



2 Fingerprint construction 

A chunk or an n-gram of a document c? is a sequence of n consecutive words found 
in d} Let Cd be the set of all different chunks of d. Note that Q is at most of size 
|<i| — n and can be assessed with 0(|t/|). Let d be a vector space representation of d 
where each c G Q is used as descriptor of a dimension with a non-zero weight. 

According to Stein (2007) the construction of a fingerprint from d can be under- 
stood as a three-step-procedure, consisting of dimensionality reduction, quantization, 
and encoding: 

1. Dimensionality reduction is realized by projecting or by embedding. Algorithms 
of the former type select dimensions in d whose values occur unmodified in 
the reduced vector d'. Algorithms of the latter type reformulate d as a whole, 
maintaining as much information as possible. 

2. Quantization is the mapping of the elements in d' onto small integer numbers, 
obtaining d". 

3. Encoding is the computing of one or several codes from d", which together form 
the fingerprint of d. 

Fingerprint algorithms differ primarily in the employed dimensionality reduction 
method. Figure 1 organizes the methods along with the known construction algo- 
rithms; the next two subsections provide a short characterization of both. 

* If the hashed breakpoint chunking strategy of Brin et al. (1995) is applied, n can be under- 
stood as expected value of the chunk length. 
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Table 1. Summary of chunk selection heuristics. The rows contain the name of the construc- 
tion algorithm along with typical constraints that must be fulfilled by the selection heuristic a. 



Algorithm 


(Author) 


Selection heuristic a(c) 


rare chunks 


(Heintze(1996)) 


c occurs once in D 


SPEX (Bernstein and Zobel (2004)) 


c occurs at least twice in D 


I-Match 




c = d\ excluding non-discriminant terms of d 


(Chowdhury et al. (2002), Conrad et al. (2003), Kolcz et al. (2004)) 


shingling 


(Broder (2000)) 


^ ^ {*-1 7 • • • 5 1 {Cj 5 • • • ) ^rand 


prefix anchor 


(Manber(1994)) 


c starts with a particular prefix, or 



(Heintze(l996)) c Starts with a prefix which is infrequent in d 

hashed breakpoints (Manber(l994)) /j(c)’s last byte is 0, or 

(Brin et al. (1995)) c’s last word’s hash value is 0 

winnowing (Schleimer et al. (2003)) c minimizes h(c) in a window sliding over d 



random (misc.) c is part of a local random choice from Cj 

one of a sliding window (misc.) c starts at word i mod mm d\ 1 <m< \d\ 

super- / megashingling c is a combination of hashed chunks 

(Broder (2000) / Fetterly et al. (2003)) which have been selected with shingling 



2.1 Dimensionality reduction by projecting 

If dimensionality reduction is done by projecting, a fingerprint for document d 
can be formally defined as follows: 

Fd = {h{c) I c G Q and g(c) = true}, 

where a denotes a selection heuristic for dimensionality reduction that becomes true 
if a chunk fulfills a certain property, h denotes a hash function, such as MD5 or Ra- 
bin’s hash function, which maps chunks to natural numbers and serves as a means for 
quantization. Usually the identity mapping is applied as encoding rule. Broder (2000) 
describes a more intricated encoding rule called supershingling. 

The objective of a is to select chunks to be part of a fingerprint which are best- 
suited for a reliable near-duplicate identification. Table 1 presents in a consistent way 
algorithms and the implemented selection heuristics found in the literature, whereas 
a heuristic is of one of the types denoted in Figure 1 . 

2.2 Dimensionality reduction by embedding 

An embedding-based fingerprint for a document d is typically constructed with a 
technique called “similarity hashing” (Indyk and Motwani (1998)). Unlike standard 
hash functions, which aim to a minimization of the number of hash collisions, a 
similarity hash function ^ U ,U C N, shall produce a collision with a high 

probability for two objects d, d^ G D, iff cp(d, d<^) > 1 — e. In this way downgrades 
a fine-grained similarity relation quantified within cp to the concept “similar or not 
similar”, reflected by the fact whether or not the hashcodes /jq,(d) and h^{Aq) are 
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Table 2. Summary of complexities for the construction of a fingerprint, the retrieval, and the 
size of a tailored chunk index. 



Algorithm 


Runtime 


Chunk 


Finger- 


Chunk 


Construction 


Retrieval 


length 


print size 


index size 


rare chunks 


0{\d\) 


0(\d\) 


n 


0{\d\) 


0{\d\-\D\) 


SPEX (0<r(l) 


0(\d\) 


0{r-\d\) 


n 


0{r-\d\) 


0{r-\d\-\D 


I-Match 


0(\d\) 


0{k) 


\d\ 


0{k) 


Oik-\D\) 


shingling 


0{\d\) 


0{k) 


n 


0{k) 


0{k-\D\) 


prefix anchor 


0{\d\) 


0{\d\) 


n 


0{\d\) 


Om\D\) 


hashed breakpoints 


0{\d\) 


0{\d\) 


E{\c\)=n 


0{\d\) 


om\D\) 


winnowing 


0(\d\) 


0{\d\) 


n 


0{\d\) 


Om\D\) 


random 


0{\d\) 


0{k) 


n 


0{k) 


om\D\) 


one of sliding window 


0{\d\) 


0{\d\) 


n 


om 


om\D\) 


super- / megashingling 


0{\d\) 


0{k) 


n 


0{k) 


0{k-\D\) 


fuzzy-fingerprinting 


om 


0{k) 


\d\ 


0{k) 


0(k-\D\) 


locality-sensitive hashing 0(\d\) 


0(k) 


\d\ 


Oik) 


Oik-\D\) 



identical. To construct a fingerprint Fd for document d a small number of k variants 
of h^, are used: 

Fd = {hi;\d)\i€{i,...,k}} 

Two kinds of similarity hash functions have been proposed, which either com- 
pute hashcodes based on knowledge about the domain or which ground on domain- 
independent randomization techniques (see again Figure 1). Both similarity hash 
functions compute hashcodes along the three steps outlined above: An example for 
the former is fuzzy-fingerprinting developed by Stein (2005), where the embedding 
step relies on a tailored, low-dimensional document model and where fuzzihcation 
is applied as a means for quantization. An example for the latter is locality-sensitive 
hashing and the variants thereof by Charikar (2002) and Datar et al. (2004). Flere the 
embedding relies on the computation of scalar products of d with random vectors, 
and the scalar products are mapped on predefined intervals on the real number line 
as a means for quantization. In both approaches the encoding happens according to 
a summation rule. 

2.3 Discussion 

We have analyzed the aforementioned hngerprint construction methods with respect 
to construction time, retrieval time, and the resulting size of a complete chunk Index. 
Table 2 compiles the results. 

The construction of a hngerprint for a document d depends on its length since 
d has to be parsed at least once, which explains that all methods have the same 
complexity in this respect. The retrieval of near-dupllcates requires a chunk index 
/r as described at the outset: /r is queried with each number of a query document’s 
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fingerprint Fd^, for which the obtained postlists are merged. We assume that both 
the lookup time and the average length of a postlist can be assessed with a con- 
stant for either method.^ Thus the retrieval runtime depends only on the size k of 
a fingerprint. Observe that the construction methods fall into two groups: methods 
whose fingerprint’s size increases with the length of a document, and methods where 
k is independent of \d\. Similarly, the size of ^ is affected. We further differentiate 
methods with fixed length fingerprints into these which construct small fingerprints 
where k < 10 and those where 10(A: < 500. Small fingerprints are constructed by 
fuzzy-fingerprinting, locality-sensitive hashing, supershingling, and I-Match; these 
methods outperform the others by orders of magnitude in their chunk index size. 



3 Wikipedia as evaluation corpus 

When evaluating near-duplicate detection methods one faces the problem of choos- 
ing a corpus which is representative for the retrieval situation and which provides a 
realistic basis to measure both retrieval precision and retrieval recall. Today’s stan- 
dard corpora such as the TREC or Reuters collection have deficiencies in this con- 
nection: In standard corpora the distribution of similarities decreases exponentially 
from a very high percentage at low similarity intervals to a very low percentage at 
high similarity intervals. Figure 2 (right) illustrates this characteristic at the Reuters 
corpus. This characteristic allows only precision evaluations since the recall perfor- 
mance depends on very few pairs of documents. The corpora employed in recent 
evaluations of Hoad and Zobel (2003), Henzinger (2006), and Ye et al. (2006) lack 
in this respect; moreover, they are custom-built and not publicly available. Conrad 
and Schriber (2004) attempt to overcome this issue by the artificial construction of a 
suitable corpus. 



Wikipedia corpus: 


Property 


Value 


documents 


6 Million 


revisions 


80 Million 


size (uncompressed) 


1 terabyte 




Fig. 2. The table (left) shows order of magnitudes of the Wikipedia corpus. The plot contrasts 
the similarity distribution within the Reuters Corpus Volume 1 and the Wikipedia corpus. 



^ We indexed all English Wikipedia articles and found that an increase from 3 to 4 in the 
chunk length implies a decrease from 2.42 to 1 .42 in the average postlist length. 
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Fig. 3. Precision and recall over similarity for fuzzy-fingerprinting (FF), locality-sensitive 
hashing (LSFl), supershingling (SSh), shingling (Sh), and hashed breakpoint chunking (FIBC). 



We propose to use the Wikipedia Revision Corpus for near-duplicate detection in- 
cluding all revisions of every Wikipedia article.^ The table in Figure 2 shows selected 
order of magnitudes of the corpus. A preliminary analysis shows that an article’s re- 
visions are often very similar to each other with an expected similarity of about 0.5 
to the first revision. Since the articles of Wikipedia undergo a regular rephrasing, 
the corpus addresses the particularities of the use cases mentioned at the outset. We 
analyzed the fingerprinting algorithms with 7 Million pairs of documents, using the 
following strategy: each article’s first revision serves as query document dq and is 
compared to all other revisions as well as to the first revision of its immediate suc- 
cessor article. The former ensures a large number of near-duplicates and hence im- 
proves the reliability of the recall values; rationale of the latter is to gather sufficient 
data to evaluate the precision (cf. Figure 2, right-hand side). 

Figure 3 presents the results of our experiments in the form of precision-over- 
similarity curves (left) and recall-over-similarity curves (right). The curves are com- 
puted as follows: For a number of similarity thresholds from the interval [0; 1] the 
set of document pairs whose similarity is above a certain threshold is determined. 
Each such set is compared to the set of near-duplicates identified by a particular fin- 
gerprinting method. From the intersection of these sets then the threshold-specific 
precision and recall values are computed in the standard way. 

As can be seen in the plots, the chunking-based methods perform better than 
similarity hashing, while hashed breakpoint chunking performs best. Of those with 
fixed size fingerprints shingling performs best, and of those with small fixed size 
fingerprints fuzzy-fingerprinting and supershingling perform similar. Note that the 
latter had both 50 times smaller fingerprints than shingling which shows the possible 
Impact of theses methods on the size of a chunk index. 



^ http://en.wikipedia.Org/wikiAVikipedia:Download, last visit on February 27, 2008 
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4 Summary 

Algorithms for near-duplicate detection are applied in retrieval situations such as 
Web mining, plagiarism detection, corporate storage maintenance, and social soft- 
ware. In this paper we developed an integrative view to existing and new technolo- 
gies for near-duplicate detection. Theoretical considerations and practical evalua- 
tions show that shingling, supershingling, and fuzzy-fingerprinting perform best in 
terms of retrieval recall, retrieval precision, and chunk index size. Moreover, a new, 
publicly available corpus is proposed, which overcomes weaknesses of the standard 
corpora when analyzing use cases from the field of near duplicate detection. 
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Abstract. The basis for most classification algorithms dealing with word sense induction and 
word sense disambiguation is the assumption that certain context words are typical of a par- 
ticular sense of an ambiguous word. However, as such algorithms have been only moderately 
successful in the past, the question that we raise here is if this assumption really holds. Start- 
ing with an inventory of predefined senses and sense descriptors taken from the University 
of South Florida Homograph Norms, we present a quantitative study of the distribution of 
these descriptors in a large corpus. Hereby, our focus is on the comparison of co-occurrence 
frequencies between descriptors belonging to the same versus to different senses, and to the 
effects of considering groups of descriptors rather than single descriptors. Our findings are that 
descriptors belonging to the same sense co-occur significantly more often than descriptors be- 
longing to different senses, and that considering groups of descriptors effectively reduces the 
otherwise serious problem of data sparseness. 



1 Introduction 

Resolving semantic ambiguities of words is among the core problems in natural lan- 
guage processing. Many applications, such as text understanding, question answer- 
ing, machine translation, and speech recognition suffer from the fact that - despite 
numerous attempts (e.g. Kilgarriff and Palmer, 2000; Pantel and Lin, 2002; Rapp, 
2004) - there is still no satisfactory solution to this problem. Although it seems rea- 
sonable that the statistical approach is the method of choice, it is not obvious what 
statistical clues should be looked at, and how to deal with the omnipresent problem 
of data sparseness. 

In this situation, rather than developing another algorithm and adding it to the 
many that already exist, we found it more appropriate to systematically look at the 
empirical foundations of statistical word sense induction and disambiguation (Rapp, 
2006). The basic assumption underlying most if not all corpus-based algorithms is 
the observation that each sense of an ambiguous word seems to be associated with 
certain context words. These context words can be considered to be indicators of this 
particular sense. For example, context words such as grow and soil are typical of the 
flora meaning of plant, whereas power and manufacture are typical of its industrial 
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meaning. Being associated implies that the indicators should co-occur signihcantly 
more often than expected by chance in the local contexts of the respective ambiguous 
word. Looking only at local contexts can be justified by the observation that for 
humans in almost all cases the local context suffices to achieve an almost perfect 
disambiguation performance, which implies that the local context carries all essential 
information. 

If there exist several indicators of the same sense, then it can not be ruled out, 
but it is probably unlikely that they are mutually exclusive. As a consequence, in 
the local contexts of an ambiguous word indicators of the same sense should have 
co-occurrence frequencies that are significantly higher than chance, whereas for in- 
dicators relating to different senses this should not be the case or, if so, only to a 
lesser extend. 

Our aim in this study is to quantify this effect by generating statistics on the co- 
occurrence frequencies of sense indicators in a large corpus. Hereby, our inventory 
of ambiguous words, their senses, and their sense indicators is taken from the Uni- 
versity of South Florida Homograph Norms (USFHN), and the co-occurrence counts 
are taken from the British National Corpus (BNC). As previous work (Rapp, 2006) 
showed that the problem of data sparseness is severe, we also propose a methodology 
for effectively dealing with it. 



2 Resources 

For the purpose of our study a list of ambiguous words is required together with 
their senses and some typical indicators of each sense. As described in Rapp (2006), 
such data was extracted from the USFHN. These norms were compiled by collecting 
the associative responses given by test persons to a list of 320 homographs, and 
by manually assigning each response to one of the homograph’s meanings. Further 
details are given in Nelson et al. (1980). 

For the current study, from this data we extracted a list of all 134 homographs 
where each comes together with hve associated words that are typical of its hrst 
sense, and another five words that are typical of its second sense. The first ten en- 
tries in this list are shown in Table 1. Note that for reasons to be discussed later we 
abandoned all homographs where either the first or the second sense did not receive 
at least hve different responses. This was the case for 186 homographs, which is the 
reason that our list comprises only 134 of the 320 items. As in the norms the ho- 
mographs were written in uppercase letters only, we converted them to that spelling 
of uppercase and lowercase letters that was found to have the highest occurrence 
frequency in the BNC. In the few cases where subjects had responded with multi 
word units, these were disregarded unless one of the words carried almost all of the 
meaning. 

Another resource that we use is the BNC, which is a balanced sample of written 
and spoken English that comprises about 100 million words. As described in Rapp 
(2006), this corpus was used without special pre-processing, and for each of the 
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134 homographs concordances were extracted comprising text windows of particular 
widths (e.g. ±10 words around the given word). 



Table 1. Some homographs and the top five associations for their two main senses. 



HOMOG. 


SENSE 1 


SENSE 2 


bar 


drink beer tavern stool booze 


crow bell handle gold press 


beam 


wood ceiling house wooden building 


light laser sun joy radiate 


bill 


pay money payment paid me 


John Uncle guy name person 


block 


stop tackle road buster shot 


wood ice head cement substance 


bluff 


fool fake lie call game 


cliff mountain lovely high ocean 


board 


wood plank wooden ship nails 


chalk black bill game blackboard 


bolt 


nut door lock screw close 


jump run leap upright colt 


bound 


tied tie gagged rope chained 


jump leap bounce up over 


bowl 


cereal dish soup salad spoon 


ball pins game dollars sport 


break 


ruin broken fix tear repair 


out even away jail fast 



3 Approach 

For each homograph we were interested in three types of information. One is the av- 
erage intra-sense association strength for sense 1, i.e. the average association strength 
between all possible pairs of words belonging to sense 1 . Another is the average intra- 
sense association strength for sense 2, which is calculated analogously. And a third is 
the average inter-sense association strength between senses 1 and 2, i.e. the average 
association strength between all possible pairs of words under the condition that the 
two words in each pair must belong to different senses. Using the homograph bar. 
Figure 1 illustrates this by making explicit all pairs of associations that are involved 
in the computation of the average strengths. Flereby the association strength a,/ be- 
tween two words i and j is computed as the number of lines in the concordance where 
both words co-occur ( fij) divided by the product of the concordance frequencies fi 
and f j of the two words: 



This formula normalizes for word frequencies and thereby avoids undesired ef- 
fects resulting from their tremendous variation. In cases where the denominator was 
zero we assigned a score of zero to the whole expression. Note that the counts in the 
denominator are observed word frequencies within the concordance, not within the 
entire corpus. 

Whereas the above formula computes association strengths for single word pairs, 
what we are actually interested in are the three types of average association strengths 
aij as depicted in Figure 1 . For ease of reference, in the remainder of the paper we 
use the following notation: 
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2x10 intra-sense associations 




sense 1 sense 2 

25 inter-sense associations 



Fig. 1. Computation of average intra-sense and inter-sense association strengths exemplified 
using the associations to the homograph bar. 



51 = average aij over the 10 word pairs relating to sense 1 

52 = average atj over the 10 word pairs relating to sense 2 

SS = average a,, over the 25 word pairs relating to senses 1 and 2 

The reasoning behind computing average scores is to minimize the problem of 
data sparseness by taking many observations into account. An important feature of 
our setting is that if we - as described in the next section - increase the number 
of words that we consider (5 per sense in the example of Figure 1), the number of 
possible pairs increases quadratically, which means that this should be an effective 
measure for solving the sparse-data problem. Note that we could not sensibly go be- 
yond five words in this study, as the number of associations provided in the USFHN 
is rather limited for each sense, so that there is only a small number of homographs 
where more than five associations are provided for the two main senses. 

When comparing the scores 51, 52, and 55, what should be our expectations? 
Most importantly, as discussed in the introduction, same-sense co-occurrences should 
be more frequent than different-sense co-occurrences, thus both 51 and 52 should be 
larger than 55. But should 51 and 52 be at the same level or not? To answer this 
question, recall that 51 relates to the main sense of the respective homograph, and 
52 to its secondary sense. If both senses are similarly frequent, then both scores are 
based on equally good data and can be expected to be at similar levels. However, if 
the vast majority of cases relates to the main sense, and if the secondary sense occurs 
only a few times in the corpus (an example being the word can with its frequent verb 
and infrequent noun sense), then the co-occurrence counts - which are always based 
on the entire concordance - would mainly reflect the behavior of the main sense, and 
might be only marginally influenced by the secondary sense. As will be shown in 
the next section, for our data 52 turns out to be at about the same level as 51. Note, 
however, that this could be an artefact of our choice of homographs, as we had to 
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pick those where the subjects provided at least five different associative responses to 
each sense. 



4 Results and discussion 

Following the procedure as described in the previous section, Table 2 shows the first 
10 out of 134 results for the homograph-based concordances of width ±100 with 
five associations being considered for each sense (for a list of these associations see 
Table 1). In all but two cases the values for SI and S2 are - as expected - both larger 
than those for SS. Only the homographs bill and break behave unexpectedly. In the 
first case this may be explained by our way of dealing with capitalization (see section 
2), in the second it is probably due to continuation associations such as break - out 
and break - away, which are not specifically dealt with in our system. 

Note that the above qualitative considerations are only meant to give an impres- 
sion of some underlying sophistications which make it unrealistic to expect an overall 
accuracy of 100%. Nevertheless, exact quantitative results are given in Table 3. For 
several concordance widths, this table shows the number of homographs where the 
results turn out to be as expected, i.e. 51 > SS and 52 > 55 (columns 3 and 4). Perfect 
results would be indicated by values of 134 (i.e. the total number of homographs) in 
each of these two columns. The table also shows the number of cases where 51 > 52, 
and - as an additional information - the number of cases where 51 and 52, 52 and 
55, and 51 and 52 are equal. As 51, 52, and 55 are averages over several floating 
point values, their equality is very unlikely except for the case when all underlying 
co-occurrence scores are zero, which is only true if data is very sparse. Thus the 
equality scores can be seen as a measure of data sparseness. As data sparseness af- 
fects all three equality scores likewise, they can be expected to be at similar levels. 
Nevertheless, to confirm this expectation empirically, scores are shown for all three. 



Table 2. Results for the first ten homographs (numbers to be multiplied by 10 ®). 



Homograph 


51 


52 


55 


Homograph 


51 


52 


55 


bar 


223 


199 


37 


board 


205 


799 


53 


beam 


1305 


1424 


123 


bolt 


1794 


3747 


962 


bill 


166 


95 


202 


bound 


675 


692 


139 


block 


194 


945 


112 


bowl 


327 


644 


25 


bluff 


934 


2778 


226 


break 


156 


63 


95 



In Table 3, for each concordance width we also distinguish four cases where 
each relates to a different number of associations (or sense indicators) considered. 
Whereas so far we always assumed that for each homograph we take five associa- 
tions into account that relate to its first, and another five associations that relate to 
its second sense (as depicted in Figure 1), it is of course also possible to reduce 
the number of associations considered to four, three, or two. (A reduction to one is 
not possible as in this case the intra-sense association strengths 51 and 52 are not 
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defined.) This is what we did to obtain comparative values that enable us to judge 
whether an increase in the number of associations considered actually leads to the 
significant gains in accuracy that can be expected if our analysis from the previous 
section is correct. 

Having described the meaning of the columns in Table 3, let us now look at 
the actual results. As mentioned above, the last three columns of the table give us 
information on data sparsity. For the concordance width of ±1 word their values 
are fairly close to 134, which means that most co-occurrence frequencies are zero 
with the consequence that the results of columns 3 to 5 are not very informative. 
Of course, when looking at language use, this result comes not so unexpected, as 
the direct neighbors of content words are often function words, so that adjacent co- 
occurrences involving other content words are rare. 

If we continue to look at the last three columns of Table 3, but now consider 
larger concordance widths, we see that the problem of data sparseness steadily de- 
creases with larger widths, and that it also steadily decreases when we consider more 
associations. At a concordance width of ±100 and when looking at a minimum of 
four associations, the problem of data sparsity seems to be rather small. 

Next, let us look at column 5 (51 > 52) which must be interpreted in conjunction 
with the last column (51 = 52). In all cases its values are fairly close to its com- 
plement 51 < 52, which is not in the table but can be computed from the other two 
columns. For example, for the concordance width of ±100 for the column 51 > 52 
we get the readings 60, 67, 65, and 68 from Table 3, and can compute the corre- 
sponding values of 58, 61, 66, and 64 for 51 < 52. Both sequences appear very sim- 
ilar. Interpreted linguistically, this means that intra-sense association strengths tend 
to be similar for the primary and the secondary sense, at least for our selection of 
homographs. 

Let us finally look at columns 3 and 4 of Table 3, which should give us an indica- 
tion whether our co-occurrence based methodology has the potential to work if used 
in a system for word sense induction or disambiguation. Both columns indicate that 
we get improvements with larger context widths (up to 100) and when considering 
more associations. At a context width of ±100 words and when considering all five 
associations the value for 51 > 55 reaches its optimum of 114. With two undecided 
cases, this means that the count for 51 < 55 is 18, i.e. the ratio of correct to incor- 
rect cases is 6.33. This corresponds to a 85% accuracy, which appears to be a good 
result. However, the corresponding ratio for 52 is only 2.77, which is considerably 
worse and indicates that some of our previous discussion concerning the weaknesses 
of secondary senses (cf. section 3) - although not confirmed when comparing 51 to 
52 - seems not unfounded. In future work, it would be of interest to explore if there 
is a relation between the relative occurrence-frequency of a secondary sense and its 
intra-sense association strength. 

What we like best about the results is the gain in accuracy when the number of 
associations considered is increased. At the concordance width of ±100 words we 
get 77 correct predictions (51 > 55) when we take two associations into account, 97 
with three, 108 with four, and 114 with hve. The corresponding sequence of ratios 
(51 > 55 / 51 < 55) looks even better: 1.88, 3.13, 4.70, and 6.33. This means that 
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Table 3. Overall quantitative results for several concordance widths and various numbers of 
associations considered. 



Width 


Assoc. 


51 >55 


52 >55 


51 >52 


II 


52 = 55 


51 =52 


±1 


2 


1 


0 


1 


131 


132 


133 




3 


2 


1 


2 


126 


127 


131 




4 


3 


3 


3 


123 


123 


128 




5 


4 


4 


4 


120 


120 


126 


±3 


2 


15 


10 


13 


105 


108 


107 




3 


23 


22 


22 


94 


97 


93 




4 


36 


30 


29 


74 


79 


79 




5 


41 


42 


33 


67 


65 


64 


±10 


2 


34 


32 


31 


78 


70 


75 




3 


59 


53 


54 


45 


51 


42 




4 


85 


64 


67 


24 


35 


19 




5 


90 


67 


68 


18 


26 


14 


±30 


2 


59 


54 


47 


43 


39 


45 




3 


84 


66 


65 


22 


25 


21 




4 


95 


86 


65 


16 


14 


11 




5 


101 


86 


70 


10 


10 


6 


±100 


2 


77 


74 


60 


18 


22 


16 




3 


97 


86 


67 


9 


11 


6 




4 


108 


94 


65 


4 


4 


3 




5 


114 


97 


68 


2 


4 


2 


±300 


2 


85 


77 


68 


8 


11 


10 




3 


96 


85 


66 


2 


5 


3 




4 


105 


94 


63 


1 


1 


2 




5 


102 


103 


68 


1 


1 


1 


±1000 


2 


75 


76 


65 


2 


6 


5 




3 


82 


89 


59 


1 


3 


1 




4 


87 


88 


68 


0 


1 


0 




5 


90 


92 


68 


0 


0 


1 



with increasing number of associations the quadratic increase of possible word pairs 
leads to considerable improvements 

5 Conclusions and future work 

Our experiments showed that associations belonging to the same sense of a homo- 
graph have significantly higher co-occurrence counts than associations belonging to 
different senses. However, the big challenge is the omnipresent problem of data spar- 
sity, which in many cases will not allow us to reliably observe this in a corpus. Our 
results suggest two strategies to minimize this problem: One is to look at the opti- 
mal window-size which in our setting was somewhat larger than average sentence 
length but is likely to depend on corpus size. The other is to increase the number 
of associations considered, and to look at the co-occurrences of all possible pairs 
of associations. Since the number of possible pairs increases quadratically with the 
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number of words that are considered, this should have a strong positive effect on 
the sparse-data problem, which could be confirmed empirically. Both strategies to 
deal with the sparse-data problem can be applied in combination, seemingly without 
undesired interaction. 

With the best settings of the two parameters, we obtained an accuracy of about 
85%. This indicates that the statistical clues considered have the potential to work. 
In addition, we see numerous possibilities for further improvement: These include 
increasing the number of associations looked at, using a larger corpus, optimizing 
window size in a more fine-grained manner than presented in Table 3, trying out 
other association measures such as the log-likelihood ratio, and to use automatically 
generated associations instead of those produced by human subjects. Automatically 
generated associations have the advantage that they are based on the corpus used, so 
with regard to the sparse-data problem a better behavior can be expected. 

Having shown how in our particular framework looking at groups of related 
words rather than looking at single words can significantly reduce the problem of 
data sparseness due to the quadratic increase in the number of possible relations, let 
us mention some more speculative implications of such methodologies: Our guess is 
that an analogous procedure should also be possible for other core problems In sta- 
tistical language processing that are affected by data sparsity. On the theoretical side, 
the elementary mechanism of quadratic expansion would also be an explanation for 
the often unrivalled performance of humans, and it may eventually be the key to the 
solution of the poverty-of-the-stimulus problem. 
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Abstract. Recommender Systems are gaining widespread acceptance in e-commerce appli- 
cations to confront the information overload problem. Collaborative Filtering (CF) is a suc- 
cessful recommendation technique, which is based on past ratings of users with similar prefer- 
ences. In contrast. Content-based Filtering (CB) exploits information solely derived from doc- 
ument or item features (e.g. terms or attributes). CF has been combined with CB to improve 
the accuracy of recommendations. A major drawback in most of these hybrid approaches was 
that these two techniques were executed independently. In this paper, we construct a feature 
profile of a user based on both collaborative and content features. We apply Latent Semantic 
Indexing (LSI) to reveal the dominant features of a user. We provide recommendations ac- 
cording to this dimensionally-reduced feature profile. We perform experimental comparison 
of the proposed method against well-known CF, CB and hybrid algorithms. Our results show 
significant improvements in terms of providing accurate recommendations. 



1 Introduction 

Collaborative Filtering (CF) is a successful recommendation technique. It is based on 
past ratings of users with similar preferences, to provide recommendations. However, 
this technique introduces certain shortcomings. For instance, if a new item appears 
in the database, there is no way to be recommended before it is rated. 

In contrast, Content-Based filtering (CB) exploits only information derived from 
document or item features (e.g., terms or attributes). Latent Semantic Indexing (LSI) 
has been extensively used In the CB field, in detecting the latent semantic relation- 
ships between terms and documents. LSI constructs a low-rank approximation to the 
term-document matrix. As a result, it produces a less noisy matrix which is better 
than the original one. Thus, higher level concepts are generated from plain terms. 

Recently, CB and CF have been combined to improve the recommendation pro- 
cedure. Most of these hybrid systems are process-oriented: they run CF on the results 
of CB and vice versa. CF exploits information from the users and their ratings. CB 
exploits information from items and their features. However being hybrid systems, 
they miss the interaction between user ratings and item features. 
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In this paper, we construct a feature profile of a user to reveal the duality between 
users and features. For instance, in a movie recommender system, a user prefers a 
movie for various reasons, such as the actors, the director or the genre of the movie. 
All these features affect differently the choice of each user. Then, we apply Latent 
Semantic Indexing Model (LSI) to reveal the dominant features of a user. Finally, we 
provide recommendations according to this dimensionally-reduced feature profile. 
Our experiments with a real-life data set show the superiority of our approach over 
existing CF, CB and hybrid approaches. 

The rest of this paper is organized as follows: Section 2 summarizes the related 
work. The proposed approach is described in Section 3. Experimental results are 
given in Section 4. Finally, Section 5 concludes this paper. 

2 Related work 

In 1994, the GroupLens system implemented a CF algorithm based on common users 
preferences. Nowadays, this algorithm is known as user-based CF. In 2001, another 
CF algorithm was proposed. It is based on the items’ similarities for a neighborhood 
generation. This algorithm is denoted as item-based CF. 

The Content-Based filtering approach has been studied extensively in the Infor- 
mation Retrieval (IR) community. Recently, Schult and Spiliopoulou (2006) pro- 
posed the Theme-Monitor algorithm for finding emerging and persistent §themesT 
in document collections. Moreover, in IR area, Furnas et al. (1988) proposed LSI to 
detect the latent semantic relationship between terms and documents. Sarwar et al. 
(2000) applied dimensionality reduction for the user-based CF approach. 

There have been several attempts to combine CB with CF. The Fab System (Bal- 
abanovic et al. 1997), measures similarity between users after first computing a con- 
tent profile for each user. This process reverses the CinemaScreen System (Salter et 
al. 2006) which runs CB on the results of CF. Melville et al. (2002) used a content- 
based predictor to enhance existing user data, and then to provide personalized sug- 
gestions though collaborative filtering. Finally, Tso and Schmidt-Thieme (2005) pro- 
posed three attribute-aware CF methods applying CB and CF paradigms in two sep- 
arate processes before combining them at the point of prediction. 

All the aforementioned approaches are hybrid: they either run CF on the results 
of CB or vice versa. Our model, discloses the duality between user ratings and item 
features, to reveal the actual reasons of their rating behavior. Moreover, we apply 
LSI on the feature profile of users to reveal the principal features. Then, we use a 
similarity measure which is based on features, revealing the real preferences of the 
user’s rating behavior. 



3 The proposed approach 

Our approach constructs a feature profile of a user, based on both collaborative and 
content features. Then, we apply LSI to reveal the dominant features trends. Finally, 
we provide recommendations according to this dimensionally-reduced feature profile 
of the users. 
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3.1 Defining rating, item and feature profiles 

CF algorithms process the rating data of the users to provide accurate recommenda- 
tions. An example of rating data is given in Figures la and lb. As shown, the example 
data set (Matrix R) is divided into a training and test set, where /i_i 2 are items and 
{/i _4 are users. The null cells (no rating) are presented with dash and the rating scale 
is between [1-5] where 1 means strong dislike, while 5 means strong like. 

Definition 1 The rating profile R{Uk) of user Ut is the k-th row of matrix R. 

For instance, R{U\) is the rating profile of user U\, and consists of the rated items 
h,h,hM,h and I\q. The rating of a user u over an item i is given from the element 
R{u,i) of matrix R. 
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(c) 



Fig. 1. (a) Training Set (n x m) of Matrix R, (b) Test Set of Matrix R, (c) Item-Feature Matrix 
F 



As described, content data are provided in the form of features. In our running 
example illustrated in Figure Ic for each item we have four features that describe 
its characteristics. We use matrix F, where element F{i, f) is one, if item i contains 
feature / and zero otherwise. 

Definition 2 The item profile Filf) of item Ik is the k-th row of matrix F. 

For instance, F{I\) is the profile of item f, and consists of features F\ and F 2 . 
Notice that this matrix is not always boolean. Thus, if we process documents, matrix 
F would count frequencies of terms. 

To capture the interaction between users and their favorite features, we construct 
a feature profile composed of the rating profile and the item profile. 

For the construction of the feature profile of a user, we use a positive rating 
threshold, P^, to select items from his rating profile, whose rating is not less than this 
value. The reason is that the rating profile of a user consists of ratings that take values 
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from a scale(in our running example, 1-5 scale). It is evident that ratings should he 
“positive", as the user does not favor an item that is rated with 1 in a 1-5 scale. 

Definition 3 The feature profile P{Uk) of user Ut is the k-th row of matrix P whose 
elements P(u,f) are given by Equation 1. 



p{uj)= Y. ( 1 ) 

\/R(u,i)>P-z 

In Figure 2, element P{Uk,f) denotes an association measure between user Uk 
and feature /. In our running example (with Px = 2), P{U 2 ) is the feature profile of 
user U 2 , and consists of features /i, /2 and fj. The correlation of a user Uk over 
a feature / is given from the element P{Uk,f) of matrix P. As shown, feature /2 
describe him better, than feature f\ does. 
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Fig. 2. User-Feature matrix P divided in (a) Training Set (n x m), (b) Test Set 



3.2 Applying SVD on training data 



Initially, we apply Singular Value Decomposition (SVD) on the training data of ma- 
trix P that produces three matrices based on Equation 2, as shown in Figure 3: 
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Fig. 3. Example of: Pnxm (initial matrix P), Unxm (left singular vectors of P), Snxm (singular 
values of P), (right singular vectors of P). 
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3.3 Preserving the principal components 

It is possible to reduce the nxm matrix S to have only c largest singular values. Then, 
the reconstructed matrix is the closest rank-c approximation of the initial matrix P as 
it is shown in Equation 3 and Figure 4: 
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Fig. 4. Example of: (approximation matrix of P), Unxc (left singular vectors of P*), Scxc 

(singular values of P*), F^xm (right singular vectors of P*). 



We tune the number, c, of principal components (i.e., dimensions) with the ob- 
jective to reveal the major feature trends. The tuning of c is determined by the infor- 
mation percentage that is preserved compared to the original matrix. 



3.4 Inserting a test user in the c-dimensional space 



Given the current feature profile of the test user u as illustrated in Figure 2b, we enter 
pseudo-user vector in the c-dimensional space using Equation 4. In our example, we 
insert U 4 into the 2-dimensional space, as shown in Figure 5: 
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Fig. 5. Example of: Unew (inserted new user vector), u (user vector), F^xe (two left singular 
vectors of V), 5^x'c (two singular values of inverse 5). 



In Equation 4, Unew denotes the mapped ratings of the test user u, whereas F^xe 
and 5^x^ ^re matrices derived from SVD. This Unew vector should be added in the 
end of the U„xc matrix which is shown in Figure 4. 

3.5 Generating the Neighborhood of users/items 

In our model, we find the k nearest neighbors of pseudo user vector in the c-dimensional 
space. The similarities between train and test users can be based on Cosine Similar- 
ity. First, we compute the matrix Unxc • Scxc and then we perform vector similarity. 
This n X c matrix is the c-dimensional representation for the n users. 
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3.6 Generating the top-A'^ recommendation list 

The most often used technique for the generation of the top-A^ list, is the one that 
counts the frequency of each positively rated item inside the found neighborhood, 
and recommends the N most frequent ones. Our approach differentiates from this 
technique by exploiting the item features. In particular, for each feature / inside the 
found neighborhood, we add its frequency. Then, based on the features that an item 
consists of, we count its weight in the neighborhood. Our method, takes into account 
the fact that, each user has his own reasons for rating an item. 



4 Performance study 

In this section, we study the performance of our Feature- Weighted User Model 
(FRUM) against the well-known CF, CB and a hybrid algorithm. For the experi- 
ments, the collaborative filtering algorithm is denoted as CF and the content-based 
algorithm as CB. As representative of the hybrid algorithms, we used the Cine- 
mascreen Recommender Agent (SALTER et al. 2006), denoted as CFCB. Factors 
that are treated as parameters, are the following: the neighborhood size (k, default 
value 10), the size of the recommendation list (N, default value 20) and the size of 
train set (default value 75%). threshold is set to 3. Moreover, we consider the di- 
vision between training and test data. Thus, for each transaction of a test user we 
keep the 75% as hidden data (the data we want to predict) and use the rest 25% 
as not hidden data (the data for modeling new users). The extraction of the content 
features has been done through the well-known internet movie database (imdb). We 
downloaded the plain imdb database (ftp.fu-berlin.de - October 2006) and selected 4 
different classes of features (genres, actors, directors, keywords). Then, we join the 
imdb and the Movielens data sets. The joining process lead to 23 different genres, 
9847 keywords, 1050 directors and 2640 different actors and actresses (we selected 
only the 3 best paid actors or actresses for each movie). Our evaluation metrics are 
from the information retrieval field. For a test user that receives a top-A recommen- 
dation list, let R denote the number of relevant recommended items (the items of the 
top-A list that are rated higher than by the test user). We define the following: 
Precision is the ratio of R to N. Recall is the ratio of R to the total number of relevant 
items for the test user (all items rated higher than P^ by him). In the following, we 
also use Fi = 2 • recall • precision / (recall + precision) . Fi is used because it combines 
both precision and recall. 

4.1 Comparative results for CF, CB, CFCB and FRUM algorithms 

For the CF algorithms, we compare the two main cases, denoted as user-based (UB) 
and item-based (IB) algorithms. The former constructs a user-user similarity matrix 
while the latter, builds an item-item similarity matrix. Both of them, exploit the user 
ratings information(user-item matrix R). Figure 6a demonstrates that IB compares 
favorably against UB for small values of k. For large values of k, both algorithms 
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converge, but never exceed the limit of 40% in terms of precision. The reason is that 
as the k values increase, both algorithms tend to recommend the most popular items. 
In the sequel, we will use the IB algorithm as a representative of CF algorithms. 




(a) (b) (c) 

Fig. 6. Precision vs. k of: (a) UB and IB algorithms, (b) 4 different feature classes, (c) 3 
different information percentages of our FRUM model 



For the CB algorithms, we have extracted 4 different classes of features from the 
imdb database. We test them using the pure content-based CB algorithm to reveal 
the most effective in terms of accuracy. We create an item-item similarity matrix 
based on cosine similarity applied solely on features of items (item-feature matrix 
F). In Figure 6b, we see results in terms of precision for the four different classes of 
extracted features. As it is shown, the best performance is attained for the “keyword” 
class of content features, which will be the default feature class in the sequel. 

Regarding the performance of our FRUM, we preserve, each time, a different 
fraction of principal components of our model. More specifically, we preserve 70%, 
30% and 10% of the total information of initial user-feature matrix R The results for 
precision vs. k are displayed in Figure 6c. As shown, the best performance is attained 
with 70% of the information preserved. This percentage will be the default value for 
FRUM in the sequel. 

In the following, we test FRUM algorithm against CF, CB and CFCB algorithms 
in terms of precision and recall based on their best options. In Figure 7a, we plot a 
precision versus recall curve for all four algorithms. As shown, all algorithms’ pre- 
cision falls as N increases. In contrast, as N increases, recall for all four algorithms 
increases too. FRUM attains almost 70% precision and 30% recall, when we recom- 
mend a top-20 list of items. In contrast, CFCB attains 42% precision and 20% recall. 
FRUM is more robust in finding relevant items to a user. The reason is two-fold :(i) 
the sparsity has been downsized through the features and (ii) the LSI application 
reveals the dominant feature trends. 

Now we test the impact of the size of the training set. The results for the Fi met- 
ric are given in Figure 7b. As expected, when the training set is small, performance 
downgrades for all algorithms. FRUM algorithm is better than the CF, CB and CFCB 
in all cases. Moreover, low training set sizes do not have a negative impact on mea- 
sure F\ of the FRUM algorithm. 
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Fig. 7. Comparison of CF, CB, CFCB with FRUM in terms of (a) precision vs. recall (b) 
training set size. 



5 Conclusions 

We propose a feature-reduced user model for recommender systems. Our approach 
builds a feature profile for the users, that reveals the real reasons of their rating be- 
havior. Based on LSI, we include the pseudo-feature user concept in order to reveal 
his real preferences. Our approach outperforms significantly existing CF, CB and hy- 
brid algorithms. In our future work, we will consider the incremental update of our 
model. 
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Abstract. This work reveals the reason for the bias in the separation levels computed for 
natural languages with only a small amount of residues; as opposed to stochastically normal 
distributed test cases like those presented in Holm (2007a). It is shown how these biased 
data can be correctly projected to true separation levels. The result is a partly new chain of 
separation for the main Indo-European branches that fits well to the grammatical facts, as well 
as to their geographical distribution. In particular it strongly demonstrates that the Anatolian 
languages did not part as first ones and thereby refutes the Indo-Hittite hypothesis. 



1 General situation 

Traditional historical linguists use a priori to look upon quantitative methods with 
suspicion, because they argue that only those agreements can decide the question, 
which are supposed to stem exclusively from their direct ancestor, the so-called 
‘common innovations’, or synapomorphies, in biological terminology (Hennig 1966). 
However, this seemingly perfect concept has in over a hundred years of research 
brought about anything but agreement on even a minimum of groupings (e.g. those 
in Hamp 2005). There is no grouping, which is not debated in one or more ways. 
‘Lumpers’ and ‘splitters’ are at work, as with nearly all proposed language families. 

Quantitative attempts (cf. Holm (2005), to be updated (2007b), for an overview) 
have not proved to be superior: First, all regard such trivial results like distinguish- 
ing e.g. Greek from Germanic as a proof; secondly, many of them are fixated on a 
mechanistic rate (or “clock") assumption for linguistic changes (what is not our fo- 
cus here); and worst, mathematicians, biologists, and even some linguists retreat to 
the too loose view that the amount of agreements is a direct measure of relatedness. 
Elsewhere I have demonstrated that this assumption is erroneous because these re- 
searchers overlook the dependence of this surface phenomenon from at least three 
stochastic parameters, the “proportionality trap" (cf. Holm (2003); Swofford et al. 
(1996:487)). 




630 Hans J. Holm 



2 Special situation 

2.1 Recapitulation: What is the proportionality trap? 

Definition: If a “mother" language splits into (two) daughter languages, these are 
“genealogically related". At the point of split, or era of separation, both start with the 
full amount of inherited features. However, they will soon begin to differentiate by 
loss or replacement of these original features. These individual replacements occur 

• independently (!) from each other, 

• by new irregular socio-psychological impacts in history. They are therefore non- 
deterministic in that the next state of the environment is partially but not fully 
determined by the previous state. Least, they are 

• irreversible, because, when a feature is changed, it will normally never reappear. 
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Fig. 1. Different agreements, same relationship)!) 



Because of these properties we have to regard linguistic change mathematically 
as a stochastic process by draws without replacement. Any statistician will immedi- 
ately recognize this as the hypergeometric distribution. In word lists, we in fact have 
all four parameters (cf. Fig. 1) needed as follows: 

• The amount of inherited features A:, and kj (or residues, cognates, symplesiomor- 
phies) regarded as preserved from the common ancestor of any two languages L; 
and Lj\ 

• the amount of shared agreements ‘a,;/’ between them; 

• the amount N of their common features at the time of separation (the universe), 
not visible in the surface structure of the data. Exactly this invisible universe N 
are we seeking for, because it represents the underlying structure, the amount 
of features, which must have been present in both languages at the era of their 
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separation in the past. And again any mathematician has the solution: This “sep- 
aration level" N for each pair of branches can be estimated by the momentum 

of the hypergeometric, the maximum likelihood estimator, transposed to 



N = kiXkj/aij 

Since changes can only lower the number of common features, a higher separation 
level must lie earlier in time, and thus we obtain a ranked chain of separation in a 
family of languages. 

2.2 Applications up to now 

The first one to propose and apply this method was the British mathematician D.G. 
Kendall (1950) with the Indo-European data of Walde/Pokorny (1926-32). It has 
then independently been extensively applied to the data of the improved dictionary 
of Pokorny (1959) by Holm (2000, passim). The results seemed to be convincing, 
in particular for the North-Western group, and also for the relation of Greek and the 
Indo-Iranian group. The late separations of Albanian, Armenian, and Hittite could 
well have been founded in their central position and therefore did not appear suspi- 
cious. 

Only when in a further application to Mixe-Zoquean data by Cysouw et al. (2006) 
a resembling observation occurred that only languages with few known residues ap- 
peared to separate late, a systematic bias could be suspected. Cysouw et al. discarded 
the SLR-method, because their results partly contradicted the subgrouping of Mixe- 
Zoquean as inferred by traditional methods of two historical linguists (which in fact 
did not completely agree with each other). In a presentation Cysouw (2004) sus- 
pected that the “unbalanced amount of available data distorts the estimates", and 
“Error 1 : they are grouped together, because of many shared retentions.” However, 
this only demonstrates that the basics explained above are not correctly understood, 
since the hypergeometric does just not rest on one parameter alone. 

In this study we will use the most modern and acknowledged Indo-European 
data base, the “Lexikon der indogermanischen Verben" (Rix et al. (2002), henceforth 
LIV-2. 1 am very obliged to the authors for sending me the digitalized version, which 
in fact only enabled me to quantify the contents in acceptable time. The reasons for 
this tremendous undertaking were: 

• The commonplace (though seldom mentioned) in linguistics that verbs are much 
lesser borrowed than nouns, what is not taken into account by any quantitative 
work up to now. 

• The more trustworthy combined work of a team at an established department of 
Indo-European under the supervision of a professional historical linguist should 
guarantee a very high standard, moreover in this second edition. 

• Compared with the in many parts outdated Pokorny, we have now much better 
knowledge of the Anatolian and Tocharian languages. 
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Fig. 2. Unwanted dependence N from k in LIV-2 list 



3 The bias 

3.1 Unwanted dependence 

Nevertheless, these much better data, not suspicious of poor knowledge, displayed 
the same bias as the other ones, as demonstrated in Fig. 2, which presents the corre- 
lation between the residues ‘k’ and the corresponding W’s in falling order. Thus we 
have a problem. The reason for this bias, opposite to Cysouw et al., could not lie in 
a poor knowledge of the data, nor could it lie in the algorithm, as I have additionally 
tested in hundreds of random cases, some of which published in Holm (2007a). Con- 
sequently, the reason had to be found in the word lists alone, the properties of which 
we will have to inspect now with closer scrutiny: 

3.2 Revisiting the properties of word lists 

The effects of scatter on the subgrouping problem as well as its handling has been 
intensively investigated by Holm (2007a). This study as well as textbooks suggest 
that the sum of residues ‘k’ should exceed at least 90 % of the universe W’, and 
a single ‘k’ must not fall below 20 %. However, since the LIV-2 database is big 
enough to guarantee a low scatter, there must be something else, overlooked up to 
now. A first hint has already been given by D.G. Kendall (1950:41), who noticed that 
“One must, however, assume that along a given segment of a given line of descent 
the chance of survival is the same for every root exposed to risk, and one must also 
assume that the several roots are exposed to risk independently". The latter condition 
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is the easier part, since linguists would agree that changes in the lexicon or grammar 
occur independently of each other. (The so-called push-and-pull chains are mainly a 
phonetic symptom and of lesser interest here). The real problem is the first condition, 
since the chance of survival is not at all the same for any feature, and every word has 
its own history. For our purpose, we must not necessarily deal with the reasons for 
these changes in detail. Could the reason for the observed bias perhaps be found in a 
distribution that contradicts the conditions of the hypergeometric, and perhaps other 
quantitative approaches, too? 

3.3 The distribution in word lists 

For that purpose, the 1195 reconstructed verbal roots of the LIV-2 are entered as 
“characters” into a spreadsheet, while the columns contain the 12 branches of Indo- 
European. We let then sum up the cross totals into a new column, containing now 
the frequency for every row. After sorting to these, we get twelve different blocks 
or slices, one for each frequency. By counting out every language per slice, we get 
12 matrices, of which we enter the arithmetic means into the final table. Let us now 
have a closer look at the plot of this as Fig. 3. 




Fig. 3. All frequencies of the LIV-2 data 
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3.4 Detecting the reason 

Immediately we observe to the right hand the few verbs which occur in many lan- 
guages, growing up to the left with the many verbs occurring in fewer languages, 
breaking down to the special case of verbs occurring in one language only. To find 
out the reason of the false correlation between these curves and the bias with the 
smaller represented languages, we must first identify the connections with our for- 
mula: N depends on the product of the residues "k’ of any language, as represented 
here as the area below their curve. This product is then divided by their agreements 
‘a\ which are naturally represented by the frequencies (bottom line). And here - not 
easily to detect - lies the key: The more to the right hand, the higher the agreements 
per residue. Further: The smaller the sum of residues of a branch (area below its 
curve), the higher is the proportion of agreements, ending in a false lower separation 
level. So far we have located the problem. But we are still far from a solution. 



4 Solution and operationalization 

Since the bias will turn up every time one employs the total of the data, we must 
compute every slice separately, thereby using only data with the same chance of 
being replaced. Restrictions in printing space do not allow to address the different 
options of implementations. In any case, it is extremely useful to pre-order the nu- 
merical outcome according to the presumptive next neighbor of every branch by the 
historical-bundling (“Bx”) method, explained in Holm (2000:84-5, or 2005:640). Fi- 
nally, the matrix allows us to reconstruct the tree. To avoid the attractions, deletions, 
and other new bias resulting from traditional clustering methods, it is methodologi- 
cally advisable to proceed on a broad front. That means, first to combine every branch 
with its next neighbor (if their is one), and only then proceed this way uphill, find- 
ing the next node by the arithmetic mean of the cross fields. This helps very well to 
rule out the unavoidable scatter. Additionally, the above-mentioned Bx- values are in 
particular helpful to “flatten" the graph, which naturally is only ordered in the one di- 
rection of descent, but might represent different clusters or circles in real geography, 
what is not displayable in such two-dimensional graph. 



5 Discussion 

Though we have ruled out the bias in the distributions of word lists, there could 
well be more bias hidden in the data: Extremely different cultural backgrounds be- 
tween compared branches would lead to fewer agreements and thus false earlier split. 
An example are the Baltic languages in a cultural environment of hunter- and gath- 
erer communities vs. the Anatolian languages in an environment of advanced civ- 
ilizations. Secondly, there may be left differences in the reliability of the research 
itself. Third, it is well-known that the relative position of the branches give raise to 
more innovations in the center vs. more conservative behavior in peripheral positions 
(“Saumlage”). 
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Fig. 4. New SLRD-based tree of the main IE branches 



6 Conclusions 

Linguistically, this study clearly refutes an early split of/from Anatolian and thereby 
the “Indo-Hittite” Hypothesis. Methodologically, the former “Separation-Level Re- 
covery method” is updated to one accounting for the Distribution (SLRD). The in- 
sights gained should prevent everybody from trusting methods not regarding the hy- 
pergeometric behavior of language change, as well as the distribution in word lists. 
People must not be dazzled by apparently good results, which regularly appear, due 
alone to very strong signals in the data, or simply by chance. 
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Abstract. It is shown that word length and other properties of linguistic units display a lawful 
behavior not only in form of distributions but also with respect to their syntagmatic arrange- 
ment in a text. Based on L-segments (units of constant or increasing lengths), F-segments, and 
T-segments (units of constant or increasing frequency or polytextuality respectively), the dy- 
namic behavior of segment patterns is investigated. Theoretical models are derived on the basis 
of plausible assumptions on influences of the properties of individual units on the properties 
of their constituents in the text. The corresponding hypotheses are tested on data from 66 Ger- 
man texts of four authors and two different genres. Experiments with various characteristics 
show promising properties which could be useful for author and/or genre discrimination. 



1 Introduction 

Most quantitative studies in linguistics are almost exclusively based on a "bag of 
words" model, i.e. they disregard the syntagmatic dimension, the arrangement of the 
units in the course of the given text. Gustav Herdan underlined this fact as early as 
1966; he called the two different types of study "language in the mass vs. language in 
the line" (Herdan (1966), p. 423). Only very few investigations have been carried out 
so far with respect to sequences of properties of linguistic units (cf. Hfebfcek (2000), 
Andersen (2005), Kohler (2006) and Uhlffova (2007)). A special approach is time 
series studies (cf. Pawlowski (2001)), but the application of such methods to studies 
of natural language is not easy to justify and is connected with methodological prob- 
lems of assigning numerical values to categorical observations. The present paper 
approaches the problem of the dynamic behavior of sequences of properties without 
limitation to a specific grouping such as word pairs or N-grams. It starts from the 
general hypothesis that sequences in a text are organized in lawful patterns rather 
than chaotically or according to a uniform distribution. There are several possibili- 
ties to define unifs such as phrases or clauses which could be used fo find patterns 
or regularities. They suffer, however, from several disadvantages: (1) They do not 
provide an appropriate granularity. While words are too small, sentences seem to be 
too large units to unveil syntagmatic patterns with quantitative methods. (2) Linguis- 
tic units are inherently connected to specific grammar models. (3) Their application 
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leads again to a "bag of units" model and thus to a loss of syntagmatic informa- 
tion. Therefore, we establish a different unit for our present project: A unit based 
on the rhythm of the property under study itself. We will demonstrate the approach 
using word length as an illustrative example. We define an L-segment as a maxi- 
mal sequence of monotonically increasing numbers, where these numbers represent 
the lengths of adjacent words of the text. Using this definition, we segment a given 
text in a left to right fashion starting with the first word. In this way, a text can be 
represented as an uninterrupted sequence of L-segments. Thus, the text fragment (1) 
is segmented as shown by the L-segment sequence (2)if word length is measured in 
terms of syllable number: 

(1) Word length studies are almost exclusively devoted to the 
problem of distributions. 

(2) (1-1-2) (1-2-4) (3) (1-1-2) (1-4) 

This kind of segmentation is similar to Boroda’s F-motiv for musical "texts". Boroda 
(1982) defined his F-motiv in an analogous way but with respect to the duration of 
the notes of a musical piece. The advantage of such a definition is obvious: Any text 
can be segmented in an objective, unambiguous, and exhaustive way, i.e. it guaranties 
that each element will be assigned to exactly one unit. Furthermore, it provides units 
of an appropriate granularity and it can be applied iteratively, i.e. sequences of L- 
segments and their lengths can be studied etc., thus the unit can be scaled over a 
principally unlimited range of sizes and granularities, limited only by text length. 
Analogically, F- and T-segments are formed by monotonically increasing sequences 
of frequency and polytextuality values. Other units than words can be used as basic 
units, such as morphs, syllables, phrases, clauses, sentences, and other properties 
such as polysemy, synonymy, age etc. can be used for analogous definitions. Our 
study concentrated on L-, F-, and T-segments, and we will report here mainly on the 
findings using L-segments. While the focus of our interest lies on basic scientific 
questions such as "why do these units show their specific behavior?" we also had a 
look at the possible use of our findings for purposes such as text genre classification 
and authorship determination. 



2 Data 

For the present study we compiled a small corpus by selecting 66 documents from 
the Projekt Gutenberg-DE (http://gutenberg.spiegel.de/). The corpus consists of 30 
poems and 36 short stories, written by 4 different German authors between the late 
18th and the early 20th century: 

Text length varies between 90 and 8500 running word forms (RWF). As to be 
expected, the poems tend to be considerably shorter than the narrative texts: The 
average length of the poems is 446 RWF and 3063 RWF of the short stories. 
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Table 1. Text numbers in the corpus with respect to genre and author 





Brentano Goethe Rilke Schnitzler 


E 


poetry 

prose 


10 10 10 

2 9 10 15 


30 

36 



3 Distribution of segment types 



Starting from the hypothesis that L-, F- and T-segments are not only units which 
are easily defined and easy to determine, but also posses a certain psychological 
reality i.e. that they play a role in the process of text generation, it seems plausible to 
assume that these units display a lawful distributional behaviour similar to the well- 
known linguistic units such as words or syntactic constructions (c.f. Kohler (1999)). 
A first confirmation - however on data from only a single Russian text - was found in 
(Kohler (2007)). A corresponding test on the data of the present study corroborates 
the hypothesis. Each of the 66 texts shows a rank-frequency distribution of the 3 
kinds of segment patterns according to the Zipf-Mandelbrot distribution, which was 
fitted to the data in the following form: 



p _ {b+^r“ 

“ F(n) ’ 



x= 1,2,3, ... ,n 
a G K 



b>-\ 
n G N 

n 

F{n) = Y^{b+i)-^ ( 1 ) 

i=i 



Figure 1 shows the fit of this distribution to the data of one of the texts on the basis of 



IPS12-I s.r) - apl-Manilelbrgn 





Fig. 1. Rank-Frequency Distribution of L-Segments 



L-segments on a log-log scale. In this case, the goodness-of-fit test yielded P(z^) 
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1.0 with 92 degrees of freedom. = 941 L-segments were found in the text forming 
Xmax =112 different patterns. Similar results were obtained for all three kinds of 
segments and all texts. Various experiments with the frequency distributions show 
promising differences between authors and genres. However, these differences alone 
do not yet allow for a crisp discrimination. 



4 Length distribution of L-segments 



As a consequence of our general hypothesis, not only the segment types but also the 
length of the segments should follow lawful patterns. Here, we study the distribution 
of L-segment length. First, a theoretical model is set up on the basis of three plausible 
assumptions: 

1 . There is a tendency in natural language to form compact expressions. This can be 
achieved at the cost of more complex constituents on the next level. An example 
is the following: The phrase "as a consequence" consists of 3 words, where the 
word "consequence" has 3 syllables. The same idea can be expressed using the 
shorter expression "consequently", which consists of only 1 word of 4 syllables. 
Hence, more compact expressions on one level go along with more complex 
expressions on the next level. Here, the consequence of the formation of longer 
words is relevant. The variable K will represent this tendency. 

2. There is an opposed tendency, viz. word length minimization. It is a consequence 
of the same tendency of effort minimization which is responsible for the first 
tendency but now considered on the word level. We will denote this requirement 
by M. 

3. The mean word length in a language can be considered as constant, at least for a 
certain period of time. This constant will be represented by q. 



According to a general approach proposed by Altmann (cf. Altmann and Kohler 
(1996)) and substituting k — K—\ and m = M — I, the following equation can be set 
up: 



Px = 



k + x— 1 
m + x— 1 



qPx-i 



( 2 ) 



which yields the hyper-Pascal distribution (cf. Wimmer and Altmann (1999)): 



Pr = 



^k-\-x 

_V X 

^m+x 



^^PoA = 0 , 1 , 2 ,... 



( 3 ) 



with = 2 Pi{k,l',m',q) - the hyper-geometric function - as norming constant. 
Here, (3) is used in a 1 -displaced form because length 0 is not defined, i.e. L- 
segments consisting of 0 words are impossible. As this model is not likely to be 
adequate also for F- and T-segments - the requirements concerning the basic proper- 
ties frequency and polytextuality do not imply interactions between adjacent levels 
- a simpler one can be set up. Due to length limitations to our contribution in this 
volume we will not describe the appropriate model for these segment types but it 
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can be said here that their length distributions can be modeled and explained by the 
hyper-Poisson distribution. 




Fig. 2. Theoretical and empirical distribution of L-segments in a poem 
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Fig. 3. Theoretical and empirical distribution of L-segments in a short story 



The empirical tests on the data from the 66 texts support our hypothesis with good 
and very good 2 values. Figures 2 and 3 show typical graphs of the theoretical and 
empirical distributions as modeled using the hyper-Pascal distribution. Figure 2 is an 
example of poetry; Figure 3 shows a narrative text. Good indicators of text genre or 
authors could not yet be found on the basis of these distributions. Flowever, only a 
few of the available characteristics have been considered so far. The same is true of 
the corresponding experiments with F- and T-segments. 
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5 TTR studies 

Another hypothesis investigated in our study is the assumption that the dynamic be- 
havior of the segments with respect to the increase of types in the course of the given 
text, the so-called TTR, is analogous to that of words or other linguistic units. Word 
TTR has the longest history; the large number of approaches presented in linguistics 
is described and discussed in (Altmann (1988), p. 85-90), who gives also a theo- 
retical derivation of the so-called Herdan model, the most commonly used one in 
linguistics: 

y = (4) 

where x represents the number of tokens, i.e. the individual position of a running 
word in a text, and y the number of types, i.e. different words, a is an empirical pa- 
rameter. However, this model is appropriate only in case of very large inventories, 
such as the vocabulary of a language. For smaller inventories, other models must be 
derived (cf. Kohler, R. and Martlnakova-Rendekova, Z. (1998), Kohler, R. (2003a) 
and Kohler, R. (2003b)). We expect model (4) to work with segment TTR, an equa- 
tion, which was derived by Altmann (1980) for the Menzerath- Altmann Law and 
later In the framework of synergetic linguistics: 

y = ax*e“,c<0. (5) 

The value of a can be assumed to be equal to unity, because the first segment of a 
text must be the first type, of course. Therefore, we can remove this parameter from 
the model and simplify (4) as shown in (5): 

y = e-"A“ = xV(^-*\c<0. (6) 

Figures 4 and 5 show the excellent fits of this model to data from one of the poems 
and one of the prose texts. Goodness-of-fit was determined using the determination 
coefficient R^, which was above 0.99 in all 66 cases. The parameters b and c of the 
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TTR: y = exp(-c)*x*b*exp(c*x) 




X 



Fig. 4. L-segment TTR of a poem 



TTR: y = exp(-c)*x''b*exp(c*x) 




X 



Fig. 5. L-segment TTR of a short story 



TTR model turned out to be quite promising characteristics of text genre and author. 
They are not likely to discriminate these factors sufficiently when taken alone but 
seem to carry a remarkable amount of information. Figure 6 shows the relationship 
between the parameters b and c. 
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Fig. 6. Relationship between the values of b and c in the corpus 



6 Conclusion 

Our study has shown that L-, F- and T-Segments on the word level display a lawful 
behavior in all aspects investigated so far and that some of the parameters, in partic- 
ular those of the TTR, seem promising for text classihcation. Further investigations 
on more languages and on more text genres will give more reliable answers to these 
questions. 
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Abstract. Dialectometry produces aggregate DISTANCE MATRICES in which a distance is 
specified for each pair of sites. By projecting groups obtained by clustering onto geography 
one compares results with traditional dialectology, which produced maps partitioned into im- 
plicitly non-overlapping DIALECT AREAS. The importance of dialect areas has been chal- 
lenged by proponents of CONTINUA, but they too need to compare their findings to older 
literature, expressed in terms of areas. 

Simple clustering is unstable, meaning that small differences in the input matrix can lead 
to large differences in results (Jain et al. 1999). This is illustrated with a 500-site data set from 
Bulgaria, where input matrices which correlate very highly (r = 0.97) still yield very different 
clusterings. Kleiweg et al. (2004) introduce COMPOSITE CLUSTERING, in which random noise 
is added to matrices during repeated clustering. The resulting borders are then projected onto 
the map. 

The present contribution compares Kleiweg et al.’s procedure to resampled bootstrapping, 
and also shows how the same procedure used to project borders from composite clustering 
may be used to project borders from bootstrapping. 



1 Introduction 

We focus on dialectal data, examined at a high level of aggregation, i.e. the average 
linguistic distance between all pairs of sites in large dialect surveys. It is important 
to seek groups in this data, both to examine the importance of groups as organizing 
elements in the dialect landscape, but also in order to compare current, computa- 
tional work to traditional accounts. Clustering is thus important as a means of seek- 
ing groups in data, but it suffers from instability: small input differences can lead to 
large differences in results, i.e., in the groups identified. 

We investigate two techniques for overcoming the instability in clustering tech- 
niques, bootstrapping, well known from the biological literature, and “noisy” clus- 
tering, which we introduce here. In addition we examine a novel means of projecting 
the results of (either technique involving) such repeated clusterings to the geographic 
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map, arguing that it is better suited to revealing the detailed structure in dialectolog- 
ical distance matrices. 



2 Background and motivation 

We assume the view of dialectometry (Goebl, 1984 inter alia) that we characterize 
dialects in a given area in terms of an aggregate distance matrix, i.e. an assignment 
of a linguistic distance d to each pair of sites il,^2 in the area Di{si,S 2 ) = d. Lin- 
guistic distances may be derived from vocabulary differences, differences in struc- 
tural properties such as syntax (Spruit, 2006), differences in pronunciation, or oth- 
erwise. We ignore the derivation of the distances here, except to note two aspects. 
First, we derive distances via individual linguistic items (in fact, words), so that we 
are able to examine the effect of sampling on these items. Second, we focus on 
true distances, satisfying the usual distance axioms, i.e. having a minimum at zero: 
ViiZJ(^i,5i) = 0; symmetry: = D{s 2 ,si)', and the triangle inequal- 

ity: V5ii2'530(^i,^2) < 0('Si,‘S3) +£>(^3,‘S2) (sce (Kruskal 1999:22). We return to the 
issue of whether the distances are ULTRAMETRIC in the sense of the phylogenetic 
literature below. 

We focus here on how to analyze such distance matrices, and in particular how 
to detect areas of relative similarity. While multi-dimensional scaling has undoubt- 
edly proven its value in dialectometric studies (Embleton (1987), Nerbonne et al. 
(1999)), we still wish to detect DIALECT AREAS, both in order to examine how well 
areas function as organizing entities in dialectology, and also in order to compare 
dialectometric work to traditional dialectology in which dialect areas were seen as 
the dominant organizing principle. 

Clustering is a standard way in which to seek groups in such data, and it 
is applied frequently and intelligently to the results of dialectometric analyses. The 
research community is convinced that the linguistic varieties are hierarchically or- 
ganized; thus, e.g., the urban dialect of Freiburg is a sort of Low Alemannic, which 
is in turn Alemannic, which is in turn Southern German, etc. This means that the 
techniques of choice have been different varieties of hierarchical clustering (Schiltz 
(1996), Mucha and Haimerl (2005)). 

Flierarchical clustering is most easily understood procedurally: given a square 
distance matrix of size n x n, we seek the smallest distance in it. Assume that this 
is the distance between i and j. We then fuse the two elements i and j, obtaining an 
n — 1 square matrix. One needs to determine the distance from the newly added i + j 
element to all remaining k, and there are several alternatives for doing this, including 
nearest neighbor, average distance, weighted average distance, and minimal variance 
(Ward’s method). See Jain et al. (1999) for discussion. We return in the discussion 
section to the differences between the clustering algorithms, but in order to focus on 
the effects of bootstrapping and “noisy” clustering, we use only weighted average 
(WPGMA) in the experiments below. 

The result of clustering is a DENDROGRAM, a tree in which the history of the 
clustering may be seen. For any two leaf nodes in the dendrogram we may determine 
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Bockelwitz ^ 

Schmannewitz — 

Borstendorf — 

Gornsdorf = 

Wehrsdorf — 

Cursdorf - — 

“ 0.03 0.04 0,05 

Fig. 1. An Example Dendrogram. Note the cophenetic distance is reflected in the horizontal 
distance from the leaves to the encompassing node. Thus the cophenetic distance between 
Borstendorf and Gornsdorf is a bit more than 0.04. 




the point at which they fuse, i.e. the smallest internal node which contains them 
both. In addition, we record the COPHENETIC DISTANCE: this is the distance from 
one subnode to another at the point in the algorithm at which the subnodes fused. 

Note that the algorithms depend on identifying minimal elements, which leads to 
instability: small changes in the input data can lead to very different groups’ being 
identified (Jain et ah, 1999). Nor is this problem merely “theoretical”. Figure 2 shows 
two very different cluster results which from genuine, extremely similar data (the 
distance matrices correlated at r = 0.97). 




Fig. 2. Two Bulgarian Datasets from Osenova et al. (to appear). Although the distance matrices 
correlated nearly perfectly (r = 0.97), the results of WPGMA clustering differ substantially. 
Bootstrapping and noisy clustering resolve this instability. 



Finally, we note that the distances we shall cluster do not satisfy the ultrametric 
axiom: Vi’i52‘^377(‘Ji,'S2) < max{D{s 2 TS^) ,D{si,s^)} (Page and Flolmes (2006:26)). 
Phylogeneticists interpret data satisfying this axiom temporally, i.e., they interpret 
data points clustered together as later branches in an evolutionary tree. The dialec- 
tal data undoubtedly reflects historical developments to some extent, but we proceed 
from the premise that the social function of dialect variation is to signal geographic 
provenance, and that similar linguistic variants signal similar provenance. If the sig- 
nal is subject to change due to contact or migration, as it undoubtedly is, then sim- 
ilarity could also result from recent events. This muddies the history, but does not 
change the socio-geographic interpretation. 
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2.1 Data 

In the remainder of the paper we use the data analyzed by Nerbonne and Siedle 
(2005) consisting of 201 word pronunciations recorded and transcribed at 186 sites 
throughout all of contemporary Germany. The data was collected and transcribed 
by researchers at Marburg between 1976 and 1991. It was digitized and analyzed 
in 2003-2004. The distance between word pronunciations was measured using a 
modified version of edit distance, and full details (including the data) are available. 
See Nerbonne and Siedle (2005). 



3 Bootstrapping clustering 

The biological literature recommends the use of bootstrapping in order to obtain 
stable clustering results (Felsenstein, 2004: Chap. 20). Mucha and Haimerl (2005) 
and Manni et al. (2006) likewise recommend bootstrapping for the interpretation of 
clustering applied to dialectometric data. 

In bootstrapped clustering we resample the data, using replacement. In our case 
we resample the set of word-pronunciation distances. As noted above, each linguis- 
tic observation o is associated with a site x site matrix Mg- In the observation matrix, 
each cell represents the linguistic distance between two sites with respect to the ob- 
servation: Mo{s,s') = D{os,Os'). In bootstrapping, we assign a weight to each matrix 
(observation) identical to the number of times it is chosen in resampling: 

_ f n if observation o is drawn n times 
° [0 otherwise 

If we resample / times, then I — yvg- The result is a subset of the original set of 
observations (words), where some of the observations may be weighted as a resulted 
of the resampling. Each resampled set of words yields a new distance matrix M,g/, 
namely the average distances of the sites using the weighted set of words obtained 
via bootstrapping. 

We apply clustering to each M,- obtained via bootstrapping, recording for each 
group of sites encountered in the dendrogram (each set of leaves below some node) 
both that the group was encountered, and the cophenetic distance of the group (at 
the point of fusion). This sounds as if it could lead to a combinatorial problem, but 
fortunately most of the 2**^ possible groups are never encountered. 

In a final step we extract a COMPOSITE DENDROGRAM from this collection, 
consisting of all of the groups that appear in a majority of the clustering iterations, 
together with their cophenetic distance. See Fig. 3 for an example. 



4 Clustering with noise 

Clustering with noise is also motivated by the wish to prevent the sort of instability 
illustrated in Fig. 2. To cluster with noise we assume a single distance matrix, from 
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which it turns out to be convenient to calculate variance (among all the distances). 
We then specify a small noise ceiling c, e.g. c = a/2, i.e. one-half standard deviation 
of distances in the matrix. We then repeat 100 times or more: add random amounts 
of noise r to the matrix (i.e., different amounts to each cell), allowing r to vary 
uniformly, 0 < r < c. 



Bockelwitz 
Schmannewitz _ 

Griiniichtenberg ~ 
RoRwein _ 
Lampertswaide “ 
Jonsdorf “ 
Rammenau 



Altlandsberg 

Liopen 

GroR Jamno _ 



Landgrafroda 

B^rstendorf 



Fig. 3. A Composite Dendrogram where labels indicate how often a groups of sites was clus- 
tered and the (horizontal) length of the brackets reflects mean cophenetic distance. 



If we let Mi stand in this case for the matrix obtained by adding noise (in the i-th 
iteration), then the rest of the procedure is identical to bootstrapping. We apply clus- 
tering to Mi and record the groups clustered together with their cophenetic distances, 
just as in Fig. 3. 

5 Projecting to geography 

Since dialectology studies the geographic variation of language, it is particularly 
important to be able to examine the results of analyses as these correspond to geog- 
raphy. 

In order to project the results of either bootstrapping or noisy clustering to the 
geographic map, we use the customary Voronoi tessellation (Goebl (1984)), in which 
each site is embedded in a polygon which separates it from other sites optimally. In 
this sort of tiling there is exactly one border running between each pair of adjacent 
sites, and bisecting the imaginary line linking the two. To project mean cophenetic 
distance matrices onto the map we simply draw the Voronoi tessellation in such a 
way that the darkness of each line corresponds to the distance between the two sites 
it separates. See Fig. 4 for examples of maps obtained by bootstrapping two different 
clustering algorithms. These largely corroborate scholarship on German dialectology 
(Konig 1991:230-231). 

Unlike dialect area maps these COMPOSITE CLUSTER MAPS reflect the variable 
strength of borders, represented by the border’s darkness, reflecting the consensus 
cophenetic distance between the adjacent sites. 

Haag (1898) (discussed by Schiltz (1996)) proposed a quantitative technique in 
which the darkness of a border was reflected by the number of differences counted in 
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a given sample, and similar maps have been in use since. Such maps look similar to 
the maps we present here, but note that the borders we sketch need not be reflected in 
local differences between the two sites. The clustering can detect borders even where 
differences are gradual, when borders emerge only when many sites are compared.' 



6 Results 

Bootstrapping clustering and “noisy” clustering identify the same groups in the 186- 
site German sample examined here. This is shown by the nearly perfect correlation 
between the mean cophenetic distances assigned by the two techniques (r = 0.997). 
Given the general acceptance of bootstrapping as a means of examining the stability 
of clusters, this result shows that “noisy” clustering is as effective. 

The usefulness of the composite cluster map may best be appreciated by inspect- 
ing the maps in Fig. 4. While maps projected from simple clustering (see Fig. 2) 
merely partition an area into non-overlapping subareas, these composite maps re- 
flect a great deal more of the detailed structure in the data. The map on the left was 
obtained by bootstrapping using WPGMA. 

Although both bootstrapping and adding noise identifies stable groups, neither 
removes the bias of the particular clustering algorithm. Fig. 4 compares the boot- 
strapped results of WPGMA clustering with unweighted clustering (UPGMA, see 
Jain (1999)). In both cases bootstrapping and noisy clustering correlate nearly per- 
fectly, but it is clear that the WPGMA is sensitive to more structure in the data. 
For example, it distinguishes Bavaria (in southeastern Germany) from the Southwest 
(Swabia and Alemania). So the question of the optimal clustering method for dialec- 
tal data remains. For further discussion see http://www.let.rug.nl/kleiweg/ 
kaarten/MDS-clusters .html. 



7 Discussion 

The “noisy”clustering examined here requires that one specify a parameter, the noise 
ceiling, and, naturally, one prefers to avoid techniques involving extra parameters. 
On the other hand it is applicable to single matrices, unlike bootstrapping, which 
requires that one be able to identify components to be selected in resampling. Both 
techniques require that one specify a number of iterations, but this is a parameter of 
convenience. Small numbers of iterations are convenient, and large values result in 
very stable groupings. 



* Fischer (1980) discusses adding a contiguity constraint to clustering, which structures the 
hypothesis space in a way that favors clusterings of contiguous regions. Since we use the 
projection to geography to spot linguistic anomalies — dialect islands, but also field worker 
and transcriber errors — we do not wish to push the clustering in a direction that would hide 
these anomalies. 
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Fig. 4. Two Composite Cluster Maps, on the left one obtained by bootstrapping using weighted 
group average clustering, and on the right one obtained by unweighted group average. We do 
not show the maps obtained using “noisy” clustering, as these are indistinguishable from the 
maps obtained via bootstrapping. The composite distance matrices correlate nearly perfectly 
(r = 0.997) when comparing bootstrapping and “noisy” clustering. 
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Abstract. The categorization of natural language texts is a well established research field in 
computational and quantitative linguistics (Joachims 2002). In the majority of cases, the vector 
space model is used in terms of a bag of words approach. That is, lexical features are extracted 
from input texts in order to train some categorization model and, thus, to attribute, for exam- 
ple, authorship or topic categories. Parallel to these approaches there has been some effort in 
performing text categorization not in terms of lexical, but of structural features of document 
structure. More specifically, quantitative text characteristics have been computed in order to 
derive a sort of structural text signature which nevertheless allows reliable text categorizations 
(Kelih & Grzybek 2005; Pieper 1975). This ‘‘bag of features” approach regains attention when 
it comes to categorizing websites and other document types whose structure is far away from 
the simplicity of tree-like structures. Here we present a novel approach to structural classifiers 
which systematically computes structural signatures of documents. In summary, we present 
a text categorization algorithm which in the absence of any lexical features nevertheless per- 
forms a remarkably good classification even if the classes are thematically defined. 



1 Introduction 

An alternative way to categorize documents apart from the well established “ bag of 
words" approach is to categorize by means of structural features. This approach func- 
tions in absence of any lexical information utilizing quantitative characteristics of 
documents computed from the logical document structure.' That means that markers 
like content words are completely disregarded. Features like distributions of sections, 
paragraphs, sentence length etc. are considered instead. 

Capturing structural properties to build a classifier assumes that given category 
separations are reflected by structural differences. According to Biber (1995) we can 
expect that functional differences correlate with structural and formal representa- 
tions of text types. This may explain good overall results in terms of F-Measure^. 



* See also Mehler et al. (2006). 

^ The harmonic mean of precision and recall is used here to measure the overall success of 
the classification 
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However, the F-Measure gives no information about the quality of the investigated 
categories. That is, no a prior knowledge about the suitability of the categories for 
representing homogenous classes and for applying them in machine learning tasks is 
provided. Since natural language categories e.g. in form of web documents or other 
textual units arise not necessarily with a well defined structural representation avail- 
able it is important to know how the classifier behaves dealing with such categories. 

Here, we investigate a large number of existing categories, thematic classes or 
rubrics taken from a 10 years newspaper corpus of SUddeutsche Zeitung (SZ 2004) 
whereas a rubric represents a recurrent part of the newspaper like 'sportsf or 'tv- 
newsf . We test systematically their goodness in a structural classifier framework ask- 
ing more specifically for a maximal subset of all rubrics which gives an F-Measure 
above a predefined cut-off c G [0, 1] (e.g. c = 0.9). We evaluate the classifier in the 
way allowing to exclude possible drawbacks with respect to: 

• the categorization model used (here SVM^ and Cluster Analysis),"* * 

• the text representation model used (here the bag of features approach) and 

• the structural homogeneity of categories used. 

The first point relates to distinguishing supervised and unsupervised learning. That 
is, we perform these sorts of learning although we do not systematically evaluate 
them comparatively with respect to all possible parameters. Rather, we investigate 
the potential of our features evaluating them with respect to both scenarios. The 
representation format (vector representation) is restricted by the model used (e.g. 
SVM). Thus, we concentrate on the third point and apply an iterative categorization 
procedure (ICP)^ to explore the structural suitability of categories. In summary, our 
experiments have twofold goals: 

1 . to study given categories using the ICP in order to filter out structurally incon- 
sistent types and 

2. to make judgements about the structural classifier’s behavior dealing with cate- 
gories of different size and quality levels. 



2 Category selection 

The 10 years corpus of the SZ used in the present study contains 95 different rubrics. 
The frequency distribution of these rubrics shows an enormous inequality for the 
whole set (See Figure 1). In order to minimize the calculation effort we reduce the 
initial set of 95 rubrics to a smaller subset according to the following criteria. 

1 . First, we compute the mean p and the standard deviation a for the whole set. 



^ Support Vector Machines. 

* Supervised vs. unsupervised respectively. 
^ See sec. 4. 
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Fig. 1. Categories/ Articles-Distribution of 95 Rubrics of SZ. 



2. Second, we pick out all rabrics R with the cardinality \R\ (the number of exam- 
ples within the corpus) ranging between the interval: 

fi — ajl < |/?| <fi + a/2 

This selection method allows to specify a window around the mean value of all doc- 
uments leaving out the unusual cases.® Thus, the resulting subset of 68 categories is 
selected. 



3 The evaluation procedure 

The data representation format for the subset of rubrics uses a vector representation 
{bag of features approach) where each document is represented by a feature vector.^ 
The vectors are calculated as structural signatures of the underlying documents. To 
avoid drawbacks (See Sec. 1) caused by the evaluation method in use, we compare 
three different categorization scenarios: 

1. Supervised scenario by means of SVM-light^, 

2. Unsupervised scenario in terms of Cluster Analysis and 

3. Finally, a baseline experiment based on random clustering. 

® The method is taken from Bock (1974). Rieger (1989) uses it to identify above-average 
agglomeration steps in the clustering framework. Gleim et al. (2007) successfully applied 
the method to develop quality filters for wiki articles. 

^ See Mehler et al. (2007) for a formalization of this approach. 

** Joachims (2002). 
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Consider an input corpus K and a set of categories C with the number of categories 
|C| =n. Then we proceed as follows to evaluate our various learning scenarios; 

• For the supervised case we train a binary classifier by treating the negative ex- 
amples of a category Ci G <D as K\ [C,] and the positive examples as a subset 
[Ci] C K. The subsets C, are in this experiment pairwise disjunct and we define 
IL = {[C ,] \Ci € C} as a partition of positive and negative examples of C,-. 
Classification results are obtained in terms of precision and recall. We calculate 
the F-score for a class C, in the following way: 



Fi = 



2 

-J—+ 1 z 

recallj precisionj 



In the next step we compute the weighted mean for all categories of the partition 
IL in order to judge about the overall separability of given text types using the 
F-Measure: 

” iCj 

F-Measure(lL) = 

i ' ' 

• In the case of unsupervised experiments we approach as follows: The unsu- 
pervised procedure evaluates different variants of Cluster Analysis (hierarchi- 
cal, k-means) trying out several linkage possibilities (complete, single, average, 
weighted) in order to achieve the best performance. Similar to the supervised 
case best clustering results are presented in terms of F-Measure values. 

• Finally, the random baseline is calculated by preserving the original category 
sizes and by mapping articles randomly to them. Results of random clustering 
help to check the success of both learning scenarios. Thus, clusterings close to 
the random baseline indicate either a failure of the cluster algorithm or that the 
separability of the text types can’t be well separated by structure. 

In summary, we check the performance of structural signatures within two learning 
scenarios - supervised and unsupervised - and compare the results with the random 
clustering baseline. Next Section describes the incremental categorization procedure 
(ICP) to investigate the structural homogeneity of categories. 



4 Exploring the structural homogeneity of text types hy means of 
the Iterative Categorisation Procedure (ICP) 

In this Section we return to the question mentioned at the beginning. Given a cut- 
off c G [0, 1] (e.g. c = 0.9) we ask for the maximal subset of rubrics allowing to 
achieve an F-Measure value F > c. Decreasing the cut-off c successively we get a 
rank ordering of rubrics ranging from the best contributors to the worst ones. The 
ICP allows to determine a result set of maximal size n with the maximal internal 
homogeneity compared to all candidate sets in question. Starting with a given set of 
input categories to be learned we proceed as follows: 
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1. Start: Select a seed category C G A and set Ai = {C}. The rank r of C equals 
r(C) = 1. Now repeat: 

2. Iteration (/ > 1 ) : Let B = A \ A,_ i . Select the category C GB which when added 
to A,_i maximizes the F-Measure value among all candidate extensions of A,_i 
by means of single categories of B. Set A,- = A,_i U {C} and r{C) = i. 

3. Break off: The iteration algorithm terminates if either 

i) A\A, = 0or 

ii) the F-Measure value of A, is smaller than a predefined cut-off or 

iii) the F-Measure value of A, is smaller than the one of the operative baseline. 
If none of these stop conditions holds repeat step (2). 

The kind of ranking described here is more informative than the F-Measure value 
alone. That is, the F-Measure gives global information about the overall separabil- 
ity of categories. The ICP in contrast, provides additional local information about 
the weights of single categories with respect to the overall performance. This in- 
formation allows to check the suitability of single categories to serve as structural 
prototypes. Knowledge about the homogeneity of each category provides a deeper 
insight into the possibilities of our approach. 

In the next Section the rankings of the ICP applied to supervised and unsuper- 
vised learning and compared with the random clustering baseline are presented. In 
order to exclude a dependence of the structural approach on one of the learning me- 
thods, we also apply the best-of-unsupervised-ranking to the supervised scenario and 
compare the outcomes. That means, we use exactly the same range having performed 
best in the unsupervised experiment for SVM learning. 



5 Results 

Table 1 gives an overview about the categories used. From the total number of 95 
rubrics 68 were selected using the selection method described in Section 2, 55 were 
considered in unsupervised, 16 in supervised experiments. The common subset used 
in both cases consists of 14 categories. 

The Y-axis of Figure 2 represents the F-Measure values and the X-axis the rank 
order of categories iteratively added to the seed set. The supervised scenario (upper 
curve) performs best ranging around the value of 1.0. The values of the unsupervised 
case decrease more rapidly (the third curve from above). The unsupervised best- 
of-ranking categorized with the supervised method (second curve from above) lies 
between the best results of the two methods. The lower curve represents the results 
of random clustering. 



6 Discussion 

According to Figure 2 we can see, that all F-Measure results lie high above the 
baseline of random clustering. All the subsets are well separated by their document 
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Table 1. Corpus Formation (by Categories). 



Category Set 


Number 


Total 


95 


Selected Initial Set 


68 


Unsupervised 


55 


Supervised 


16 


Unsupervised n Supervised 


14 




Categories 



Fig. 2. The F-Measure Results of All Experiments 



Structure which indicates a potential of structure-based categorizations. The point 
here was to observe the decrease of the F-Measure value while adding new cate- 
gories. 

The supervised method shows the best results remaining stable with a grow- 
ing number of additional categories. The unsupervised method shows a more rapid 
decrease but is less time consuming. Cluster Analysis succeeds to rank 55 rubrics 
whereas SVM-light ranks only 16 within the same time span. 

In order to compare the performance of both methods (supervised vs. unsuper- 
vised) more precisely we ran the supervised categorization based on the best-off- 
ranking of the unsupervised case. The resulting curve remains longer stable than the 
unsupervised one. Since the order and the features of categories are equal, the re- 
sulting difference indicates an overall better accuracy of SVM compared to Cluster 
Analysis. 

One assumption for the success of the structural classifier was that the perfor- 
mance may depend on the article size, that is, on the representativeness of a category. 
To account for this, we compared the category size of the best-off-rankings of both 
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Fig. 3. Categories/Articles-Distribution of Sets used in Supervised/Unsupervised Experi- 
ments. 



experiments. Figure 3 shows a high variability in size, which indicates that the size 
factor does not influence the classifier. 



7 Conclusion 

In this paper we presented experiments which shed light on the possibilities of a 
classifier operating with structural signatures of text types. More specifically, we in- 
vestigated the ability of the classifier to deal with a large number of natural language 
categories of different size and quality. The best-off-rankings showed that different 
evaluation methods (supervised/unsupervised) prefer different combinations of cat- 
egories to achieve the best separation. Furthermore, we could see that the overall 
difference in performance of two methods depends rather on the method used than 
on the combination of categories. 

Another interesting finding is that the structural classifier seems not to depend on 
category size allowing a good categorization of small, less representative categories. 
That fact motivates to use logical document (or any other kind of) structure for ma- 
chine learning tasks and to extend the framework to more demanding tasks, when it 
comes to deal with, e.g., web documents. 
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Abstract. Scenario techniques have become popular tools for dealing with possible futures. 
Driving forces of the development (the so-called key factors) and their possible projections 
into the future are determined. After a reduction of the possible combinations of projections to 
a set of consistent and probable candidates for possible futures, traditionally one-mode cluster 
analysis is used for grouping them. In this paper, two-mode clustering approaches are proposed 
for this purpose and tested in an application for the future of eLearning in higher education. 
In this application area, scenario techniques are a very young and promising methodology. 



1 Introduction: Scenario analysis 

Since its first applications for business prognostication (e.g., Kahn, Wiener (1967), 
Meadows et al. (1972), Schwartz (1991)), scenario techniques have become popular 
tools for governmental and corporate planners in order to deal with possible futures 
(“scenarios”) and to support decisions in the face of uncertainty. Nowadays, in many 
research areas scenario analysis is an attractive tool with a huge variety of applica- 
tions (e.g., Gotze (1993), MiBler-Behr (2002), Welfens et al. (2004), van der Heij- 
den (2005), Pasternack (2006), Ringland (2006)). However, for higher education, the 
application of scenario analysis is new (e.g., Sprey (2003)). Different methodologi- 
cal approaches have been proposed, most of them using (roughly) four stages (e.g., 
Coates (2000), Phelps et al. (2001)): 

• In a first stage, the scope of the scenario analysis has to be defined including 
the focal issues (e.g. influence areas) and the driving forces for them (social, 
economic, political, environmental, technological factors). After a reduction of 
these driving forces with respect to relevance, importance, and inter-connection, 
a list of so-called key factors results (e.g.. A, B, C). 

• Then, in the second stage, alternative projections (possible levels) for these key 
factors (e.g., Al, A2, A3, Bl, B2) have to be determined. By combining these 
projections, a database of candidates for possible futures (e.g., (A1,B1,C1,...), 
(A1,B2,C1,...)) is available. Additionally, the consistency for pairs of projections 
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(e.g., (A1,B1), (A1,B2)) and the probability/realism of single projections within 
the time span under research has to be rated. 

• Then, in a third stage, the candidates in the database have to be evaluated on basis 
of their projections’ pairwise consistency and probability. Using rankings and/or 
cut-off values or similar approaches, the database is reduced to a set of consistent 
and probable candidates. Finally, the reduced set of candidates (the so-called 
hrst mode), described by their projections w.r.t. the key factors (the so-called 
second mode), is grouped via cluster analysis into a small number of candidate 
groups, the so-called “scenarios”. In an unrelated second step these candidate 
groups have to be analyzed to find out which projections best characterize them. 
Recently, new fuzzy clustering approaches have been proposed for dealing with 
this identihcation problem (see e.g. MiBler-Behr (1993), (2002)). 

• Finally, in a fourth stage, strategic options how to deal with the selected possible 
futures (“scenarios”) have to be developed. 

In this paper we develop new two-mode clustering approaches for simultaneously 
grouping candidates and projections in the third stage. The new approach bases on 
Baier et al. (1997)’s two-mode additive clustering procedure for simultaneous market 
segmentation and structuring with overlapping and non-overlapping cases. 



2 Two-Mode clustering (for scenario evaluation) 

2.1 The model 

As in Baier et al. (1997), the following notation is used (see Krolak-Schwerdt, 
Wiedenbeck (2006) for a recent comparison of similar additive clustering approaches): 
/=!,. . . ,/ is an index for first mode objects (e.g., preselected consistent and probable 
candidates (A1,B1,C1,...) or (A1,B2,C1,...) from stage two). is an index 

for second mode objects (e.g., projections Al, A2, A3, ...). k=\,. . . is an index for 
first mode clusters (cluster of candidates) and 1=1,. . . ,L an index for second mode 
clusters (clusters of projections). S = (sij)jxj is a matrix of (observed) associations 
between first and second mode objects (sij G R Vi,)). With association values of 1 
- if the projection is part of the candidate - or 0 - if the projection is not part of the 
candidate -, S is a binary data matrix (see, e.g., Li (2005) for an analysis of binary 
data using two-mode clustering). 

Model parameters are the following: P={pik)ixK is a binary matrix describing 
hrst mode cluster membership with Pik=^ if first mode object i belongs to first mode 
cluster k and =0 otherwise. Q={qji)jxL is a binary matrix describing second mode 
cluster membership with qji=l if second mode object j belongs to second mode 
cluster I and =0 otherwise. W={wki)KxL is a matrix of weights (wti G R Vk,Z). 

In order to provide results where candidates are members of one and only one 
scenario whereas projections are allowed to be member of none, one, or more than 
one scenario, additional assumptions are necessary: The first mode membership ma- 
trix P is restricted to be non-overlapping (i.e. 'Yl^=iP‘k = 1 V/) whereas for the 
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second mode membership matrix Q no such restrictions hold. Q is allowed to be 
overlapping. 

2.2 Parameter estimation 

The parameters are determined in order to minimize the objective function 

I J K L 

Z = Y^Y^{sij-Sijf with Sij = Y^Yl PikWkiqji yij, (1) 

i'=l i=l k=l l=l 

or, equivalently, to maximize the variance accounted for 

I j I J 

VAF=1-Z/^^(.,;,-^-)2 with 7=^^. ,,■/(//) (2) 

1=1 j=l i=l 7=1 

on the basis of the underlying model S = PWQ’ + error. 

In our approach, an alternating least squares procedure is applied. The different sets 
of model parameters (P, W, and Q) are initialized and alternatingly improved w.r.t. 
Z. Alternatively, a Bayesian model formulation could be used (see DeSarbo et al. 
(2005) in a market structuring setting). However, for our approach, we hrst discuss 
the iterative steps for obtaining improved estimates for selected model parameters 
when estimates for the remaining sets of model parameters are given. Finally, the 
complete procedure is presented. 

a) Estimation of ¥ for given W and Q: Set 

f j L J L 

1 if = min {Y^{sij-^wmjif} 

7=1 1=1 i=i 1=1 V/,k. (3) 

0 otherwise 



b) Estimation o/Q and for given P: Using (for 1=1, ...,L selected) 

I J K L K 

^=EE(^o-E E PikWkVqjl' - E (4) 

,=1 7=1 k=l k=l 




(siji is constant w.r.t. qu,. . . ,qji,w\i,. . . ,wki), estimates of Q and W can be obtained 
by starting from initial values and alternatingly improving the parameter estimates 
for second mode cluster / = 1 , . . . , L via 



klji = 



IK I 

1 if '^{Sijl-'^PikWklf <Y^{Sijif 
i=\ k=l i=l 

0 otherwise 



V; 



(5) 
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and minimizing 

I J K 

via OLS w.r.t. {wu,---,wa} (6) 

,= 1 /=! k=l 

(OLS=ordinary least squares regression). 

Thus, our estimation procedure can be described as follows: 

1. Determine initial estimates of P, W, and Q. Compute Z. 

2. Repeat 

Improve the estimates of P using a). 

Improve the estimates of Q and W using b). 

Until Z cannot be improved any more. 

For applying the above model and algorithms for scenario evaluation, additionally, 
the first and second mode clusters can be linked by setting K=L and restricting W 
to an identity matrix. This can be achieved by initialization and by omitting the cor- 
responding algorithmic steps where W is updated. In the following section, this ap- 
proach (with K=L and W restricted to an identity matrix) is applied in stage three of 
a scenario analysis in higher education. 



3 Example: Scenario evaluation in higher education 

3.1 Stage One: Defining the scope of the analysis 

Currently, at many universities, the concrete future of higher education and how to 
deal with this uncertainty is unclear. Whereas some developments like the demo- 
graphics (older and fewer Germans), the ongoing of the Bologna-process (more stan- 
dardization and Europe- wide exchange in higher education), the importance of better 
and life-long education, or the higher competition between universities for funds and 
talented students seem to be predictable, other developments are highly uncertain 
(see, e.g., Michel (2006), Opaschowski (2006), Schulmeister (2006)). 

Especially for universities that plan to invest in technical teaching and learn- 
ing environments and/or plan to attract more students for distance learning - this is 
unbearable. Therefore, our main research question deals with the future of higher 
education. As a focal time point we use the year 2020. Also, this analysis is used as 
an application example for our new two-mode clustering approach. 

In the first stage of our scenario analysis, basing on a Delphi-study on the future 
of eLearning, acceptance and preferences surveys, and other research projects at our 
institute (e.g. Gocks (2006)) as well as from other research institutes (e.g. Cuhls et 
al. (2002), Opaschowski (2006)) (university) internal as well as (university) external 
influencing factors on higher education were identified and possible projections for 
the near future were described. 

Moreover, using expert workshops with teachers, students, people from univer- 
sity administration and government, these lists and descriptions were extended and 
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modified, resulting in six areas of influence and thirty influencing factors (see figure 
1) with a total of 73 detailed described projections w.r.t. these influencing factors. 




Fig. 1. Influencing factors overview 



3.2 Stage Two: Creating a database of candidates 

In the second stage of scenario analysis, these thirty influencing factors were reduced 
to 12 key factors for the ongoing analysis. We did this by filtering redundant aspects 
and indirect dependencies. Additionally, we used scoring methods and evaluation as- 
pects from a group of scientific experts and analyzed relevant scientific sources (see, 
e.g., Krohnert et al. (2004), Michel (2006)). Furthermore, the alternative projections 
for each key factor were reduced and specified in detail (resulting in one page text 
for each projection). As a result, a database of 2^'3'=6,144 candidates (all possi- 
ble combinations of the 2-3 projections for each of the 12 key factors) for possible 
futures was available. 

Additionally, the pairwise consistency of these projections was evaluated using 
values ranging from l=“totally inconsistent” to 9=“totally consistent”. Consequently, 
as discussed in the theoretical introduction, a consistency value was calculated for 
each candidate (e.g. (A1,B2,C3,...)) as the mean pairwise consistency of its pairs of 
projections (e.g. (A1,B2), (A1,C3), (B2,C3),...). 

3.3 Stage Three: Evaluating, selecting, and clustering candidates 

In a third stage the database was first reduced and then clustered. For reduction, the 
so-called "‘complete combination scanning"’ was used, what means that for each 
pair of projections that candidate with the highest mean pairwise consistency was 
kept for further analysis. The reduction resulted into 286 candidates. 
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The binary descriptions of these candidates resulted into a binary database S with 
286 rows and 25 columns. This database was - in the follow-up analysis - subjected 
to the two-mode clustering approaches for scenario evaluation from section 2.2 with 
identical numbers K and L and W restricted to an identity matrix (for linking first- 
and second-mode clusters). 

The resulting VAF-values from analyses with totals of K=L=\ to 8 clusters 
(VAF=0.056, 0.243, 0.325, 0.362, 0.363, 0.394, 0.448, 0.452) Indicate via an elbow 
criterion that a two- or a four-class solution should be preferred. When focusing on 
the two-class solution, the first- and second-mode memberships of the results lead 
to two scenario interpretations, a scenario 1 “A Technology Based Future” and a 
scenario 2 “A Worse Perspective” (Note that the follow-up discussion of the two sce- 
narios is mainly based on the projections within the two derived two-mode clusters). 

3.4 Stage Four: Developing strategic options 

Scenario 1: A Technology based future: This scenario presents a dilly future per- 
spective for higher education. Students have passion for technology in the sense of 
education technologies and learning software. They are motivated to learn like con- 
scientious learners. The university lecturers see a greater importance in giving lec- 
tures than In doing research. 

The traditional lecture forms will be enhanced by eLearning components like 
online teaching and blended learning scenarios. There will be a unity of traditional 
and new lesson forms. The future will contain state universities as well as private 
ones in the education market. 

The learning infrastructure and administration environment (technology, build- 
ings, networks, etc.) will be excellent. Because of hard competition in the education 
market, the universities are very flexible and try to be better than their competi- 
tors. They are able to assimilate new aspects and trends in learning innovations (like 
eLearning) very quickly. The usage of information and communication technologies 
is established very well and In higher education eLearning aspects are used very 
often. 

eLearning aspects help to enforce individualised learning for better results in the 
studies of each student. These facts will be supported by a high level of education 
awareness in the whole society in addition. The importance of job market issues 
forces the students to acquire an additional expertise in languages, soft skills, and 
other competences. 

Scenario 2: A worse perspective: The second extreme scenario presents us the 
complete opposite to scenario 1. The future in higher education is not very attractive. 
No interested and committed students in the study courses, lecturers with little inter- 
est in teaching, no changes in traditional ways of teaching and no private education 
suppliers in the market. Universities have resources to offer an optimal learning envi- 
ronment and infrastructure (library, internal working places, etc.). No flexibility will 
prevailed at the universities and no eLearning technologies will be used. The conse- 
quence is that no individualized learning will be offered. Education is no longer an 
emphasis from the society point of view. 
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When analyzing the four-class solution, the above results are supported: Again 
the two extreme scenarios could be found, but now two additional in-between sce- 
narios are available. These two scenarios mainly differ from the above two w.r.t. the 
university principle (state, private, or mixed) and the importance of job market issues 
on the teaching contents and environment (high or low influence). 



4 Conclusions 

In this paper, we have introduced new two-mode clustering approaches for scenario 
evaluation. It fits naturally in the traditional four-stage-approach to scenario analysis 
by alternatively analyzing the database of consistent candidates for possible futures. 
In contrast to the traditional one-mode clustering approaches for this purpose, the 
two-mode approach quite naturally develops clusters of candidates and describing 
projections. No follow-up decisions concerning fuzzy memberships of candidates or 
memberships of projections have to be made. 
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Abstract. The process of assigning keywords to a special group of objects is often called tag- 
ging and becomes an important character of community based networks like Flickr, YouTube 
or Last.fm. This kind of user generated content can be used to define a similarity measure for 
those objects. The usage of Emergent-Self-Organizing-Maps (ESOM) and U-Map techniques 
to visualize and cluster this sort of tagged data to discover emergent structures in collections 
of music is reported. An item is described by the feature vector of the most frequently used 
tags. A meaningful similarity measure for the resulting vectors needs to be defined by remov- 
ing redundancies and adjusting the variances. In this work we present the principles and first 
examples of the resulting U-Maps. 



1 Introduction 

The increased interest in folksonomies like Flickr, Last.fm, YouTube, del.icio.us and 
other community based networks shows, that tagging is already used by many users 
to discover new material and becomes a collaborative way of classifying items, being 
controlled by the creator and consumer of the content. One popular way to visualize 
tag relations is the use of tag clouds. They are used to visualize the most used tags on 
a website. More frequently used tags have a larger font and they are normally ordered 
alphabetically. For our study we chose to analyse the data provided by the music 
community Last.fm, an internet radio featuring a music recommendation system. 
The users can assign tags to artists and browse the content via tags allowing them to 
only listen to songs tagged in a certain way. 

Tags make it possible to organize the media (artists and songs) in a semantic way 
and states a useful base for discovering new music. Because of the huge amount of 
artists and songs, an intuitive user interface is required to avoid losing the overview. 
We propose the Emergent-Self-Organizing-Map (ESOM) (Ultsch (2003)) to cluster 
tagged data because it has some advantages over other clustering algorithms. It is 
topology preserving and combined with the U-Map it provides a visually appealing 
user interface and an intuitive way of exploring new content. The remainder of this 
paper is organized as follows. First some related work on tagged data, clustering mu- 
sic and documents with the ESOM is presented. Then we describe the main learning 
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algorithm of the ESOM in section 3 together with the U-Map visualization. Next, the 
dataset is presented together with the used methods of data preparation. In section 5 
we present our experimental results. We round off the paper by giving the conclusion 
in section 6 together with future research directions. 



2 Related work 

There has been some work on enhancing the user interface based on tags and we will 
briefly mention some here. Flickr uses Flickr clusters which can provide related tags 
to a popular tag, grouped into clusters. Begelman (2006) uses clustering algorithms to 
find strongly related tags visualizing them as a graph. Hassan-Montero et al. (2006) 
propose a method for an improved tag cloud and a technique to display these tags 
with clustering based layout. 

The ESOM has already been used successfully to visualize collections of music, 
photos and on clustering documents. Most of these works have in common that they 
cluster the data based on features extracted directly from the media. An example is 
MusicMiner (Morchen (2005)) which uses the timbre distance, a measure based on 
frequency analysis of audio data. The websom project (Kaski (1998)) is an ESOM 
based approach in free text mining. Here each document is encoded as a histogram of 
word categories which are formed by the ESOM algorithm based on the similarities 
in the contexts of the words. 

Although our approach is different because we are not using information that can 
be extracted from the objects’ raw data itself but instead user generated content, the 
works mentioned previously show that the ESOM is a powerful tool in visualizing 
high dimensional data. 



3 Emergent Self Organizing Maps 

The ESOM is an artificial neural network that performs a mapping from a high di- 
mensional data space onto a two-dimensional grid of neurons. The unsupervised 
training process is partly motivated by how visual information is handled in the cere- 
bral cortex of the mammalian brain and equals a regression of an ordered set of 
model vectors wi, € /?" into the space of observation vectors x € by performing 
the following process: 



m{t+ 1 ) = mi{t)+h^(^^i{x{t)-mi{t)) 

where t is the sample index of the regression step, whereby the regression is per- 
formed recursively for each presentation of a sample of x. Index c, the bestmatching 
unit (BMU) or winner, is defined by the condition 

I \x{t) - OTc(t) 1 1 < I \x{t) - mi{t) I |Vi 

The so called neighbourhood function h is often taken to be the Gaussian 
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Iki- felP 

hc(x),i = 0.{t)exp{ 

where 0 < a(t) < 1 is the learning-rate factor, which decreases monotonically with 
the regression steps, r, and are the vectorial locations in the display grid and o(t) 
corresponds to the width of the neighbourhood function, which is also decreasing 
monotonically with the regression steps. For a more detailed discussion of the SOM 
see Kaski (1997). 



U-Map visualization 

The U-Map (Ultsch (2003)) is constructed on top of the map of ESOM. The U-Height 
for each neuron n; equals the accumulated distances of n; to its immediate neighbors 
N{i). It is calculated as follows: 

U-Height(n,) = d{mi,mj) 

where d{x,y) is the distance function used in the SOM algorithm to construct the 
map and N{i) denotes the indices of the immediate neighbours of neuron i. 

A single U-Height shows the local distance structure of the corresponding neu- 
ron. The overall structure of densities emerges, if a global view of a U-Map is re- 
garded. A U-Map is usually displayed as a three dimensional landscape and has 
become a standard tool to display the distance structures of the ESOM. Therefore 
the U-Map delivers a ’landscape’ of the distance relationships of the input data in the 
data space. It has the property that weight vectors of neurons with large U-Heights are 
very distant from other vectors in the data space and that weight vectors of neurons 
with small U-Heights are surrounded by other vectors in the data space. Outliers and 
other possible cluster structures can easily be recognized. U-Maps have been used in 
a number of applications to detect new and meaningful knowledge in data sets. 



4 Data 

We extracted 1200 artists from the Last.fm website together with the 250 most fre- 
quently used tags like rock, pop, metal, etc. 

4.1 Peparation of the datasets 

Before the ESOM can be trained, special demands have to be fulfilled. Tags from the 
Last.fm dataset which do not stand for a certain kind of music genre, like seen-live, 
favourite albums, etc. were excluded. Highly correlated tags were condensed to a 
single feature. For the preparation of the tagged data we used a modification of the 
Inverse Document Frequency (IDF). 
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Fig. 1. The variances of the seven most popular tags 



Last.fm provides the number of people (t,y = tagcountij) that have used a specific 
tag i for an artist j. We scaled tij to the range of [0, 1]. Then we slightly modified the 
term frequency to be more appropriate for tagged data: 



tf . . = ‘1 



with the denominator being the accumulated frequencies over all tags used for 
artist j. The IDF of tag i is defined as 



idf,- = log^^ 

Ek^ik 

with |Z)| being the total number of artists in the collection and Ek^ik being the 
accumulated frequencies of tag i over all artists. The resulting importance of tag i for 
artist j is given by 



tfid% = tf,;idf, 

As can be seen in figure 1 all the tags of the Last.fm dataset differ a lot in variance 
but for a meaningful comparison of the variables these variances have to be adjusted. 
For this purpose we used the empirical cumulative distribution function (ECDF). The 
idea behind the ECDF is to assign a probability of ^ to each of the n observations in 
the sample. The final tag frequencies are then given by 



tfidffi-^^^ : 



|tfid% < tfidfjj 



,k = l..n 



The adjusted variances after applying ECDF can be seen in figure 2. The accu- 
mulated tag frequencies of the Last.fm dataset can be seen in figure 3. 

Finally we optain the feature vector Wj for artist j as 



Wi 



.,■=(tfidft^^..,tfidf^|^0 

In the context of self organizing maps, two different measures have been pro- 
posed to compute the similarity between two feature vectors w, and Wj. The first 
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(a) The ECDF adjusted variances of the (b) Accumulated tag frequencies 
Last.fm tags 



Fig. 2. Tag variances 



method uses the familiar euclidean distance, while the second approach is based on 
the cosine similarity 



COs{Wi,Wj) 



w'iWj 




This method emphasizes the relative values that each dimension has within each 
vector and not their overall length. Two vectors can have a value of zero even if their 
euclidean distance is arbitrarily large. A SOM model which uses the cosine similarity 
instead of euclidean distance has also been proposed by Kohonen (1982), introduced 
as Dot-Product-SOM, and has been succesfully used for document clustering prob- 
lems. For the close analogy to tag spaces we decided to use this model rather than 
the standard model based on the euclidean distance. 

Note, that the update function changes to 



i{t + 1 ) — 



mi{t)+hc(^j,yi{t)x{t) 
mi{t) + h^(jc),i{t)x{t) 



Although the training process slows down due to the normalization at each step, 
the search for the bestmatch is very fast and simple. 



5 Experimental results 

We trained a 80x50 emergent self organizing map using 50 epochs with the prepro- 
cessed data using the Databionics ESOM tool (Ultsch and Morchen, 2005). A toroid 
topology was used to avoid border effects. 

The U-Map in hgure 4 (visualized using SpinJD) can be interpreted as height 
values on top of the usually two dimensional grid of the ESOM, leading to an in- 
tuitive paradigm of a landscape. Clearly dehned borders between clusters, where 
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Fig. 4. Zoom of cluster rock to illustrate the good innercluster quality. 



large distances in data space are present, are visualized in form of high mountains. 
Smaller intra cluster distances or borders of overlapping clusters form smaller hills. 
Homogeneous regions of data space are placed in valleys. 

Detailed inspection of the map shows a very good conservation of the intercluster 
relations between the different music genres. One can observe smooth transitions 
between clusters like metal, rock, indie and pop. 

In figure 5 we show a detailed view of the cluster rock. 

The innercluster relations, e.g. the relations between genres like hard rock, clas- 
sic rock, rock and roll and modern rock are very well preserved. This property also 
holds for the other clusters. 

An interesting area is the little cluster metal next to the cluster classic. A precise 
examination revealed the reason for this cluster not being part of the big cluster metal. 
The cluster classic contains the former classic artists like Ludwig van Beethoven on 
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the lower right edge with a transition to newer artists of the classical genre when 
moving to the upper left. The neighbouring artists of the minicluster metal are bands 
like Apocalyptica and Therion which use a lot of classical elements in their songs. 



6 Conclusion and future work 

Our goal was to find a visualization method that fits the need and constraints of 
browsing collections of tagged data. A high dimensional feature vector of 250 di- 
mensions is hard to grasp and clustering can reveal groups of similar objects based 
on their tags. The global organization of the tagged artists worked really well and in 
contrast to other clustering algorithms, soft transitions between the groups of similar 
tagged artist can be seen. The modified Inverse Document Frequency turned out to be 
a good preparation method when working with tagged data. It is however essential 
for the ESOM that the feature vectors are not to sparse and that the overlap between 
them is not to low. These problems occurred in experiments with the photo commu- 
nity flickr where information about tags is only binary (a tag occurs or not) without 
information about the tag frequencies. 

We showed that the ESOM enables the user to navigate through the high dimen- 
sional space in an intuitive way. Euture work could include combining the clustering 
of artists and their songs and an automatic playlist generation system from regions 
and paths on the map. The maps presented here can be seen in color and high reso- 
lution at www.indiji.com/musicsom. 
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Abstract. In archaeometry the focus is mainly on chemical analysis of archaeological arti- 
facts such as glass objects or pottery. Usually the artefacts are characterized by their chemical 
composition. Here the focus is on cluster analysis of compositional data. Using Euclidean 
distances cluster analysis is closely related to principal component analysis (PCA) that is 
a frequently used multivariate projection technique in archaeometry. Since PCA and cluster 
analysis based on Euclidean distances are scale dependent, some kind of "appropriate" data 
transformation is necessary. Some different techniques of data preparation will be presented. 
We consider the log-ratio transformation of Aitchison and the transformation into ranks in 
more detail. Erom the statistical point of view the latter is a robust method. 



1 Introduction 

Often the archaeometric data we analyze are measured with respect to the chemical 
compositions of many variables that usually have quite different scales. For example, 
Mucha et al. (2001) investigated a data set of ancient coarse ceramics by cluster 
analysis, where the set of 19 variables consists of nine oxides and ten trace elements 
(see below Section 6). The former are given in percent and the latter are measured in 
parts per million (ppm). Hence some kind of treatment of the data is necessary since 
PCA and cluster analysis based on Euclidean distances are scale dependent. Without 
some standardization, the Euclidean distances can be fully dominated by the variable 
in the more sensitive units. However, as we will see below, an inappropriate data 
transformation can result in covering the differences between well-separated groups 
(clusters). Moreover it can produce outliers. 

Besides different scales of the variables, often problems with outliers and with 
long-tailed (skew) distributions of the variables were addressed in the archaeometric 
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data, see recently Baxter (2006). Figure 1 shows an example taken from Baxter and 
Freestone (2006) (see also below Section 5). This is discrete data rather than metric 
data: the measurements are given as 0.01, 0.02 and so on. The usual way of dealing 
with outliers seems to be omitting them, see for instance Baxter (2006) and Bax- 
ter and Freestone (2006). Another more objective way is using transformation into 
ranks, as it will be shown below. 



HiStOgrdm of MnO (Baxter & rr«#ston« Archaaomatry 4B, 2006} 
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Fig. 1. The frequency plot of MnO of 80 objects shows a skew density. Additionally, at the 
bottom the corresponding rank values are shown. 



Indeed, the performance of multivariate statistical methods like cluster analysis 
and PC A is often seriously affected by these two main problems: scale dependence 
and outliers. Concerning PCA see Baxter (1995) and Baxter and Freestone (2006). 
Therefore data transformations and outlier treatment are highly recommended by 
these authors. 

Here different data transformations will be presented and compared. Our inves- 
tigation shows that especially nonparametric transformations like the transformation 
of the data into a matrix of ranks for subsequent multivariate statistical analysis give 
good and for archaeologists reasonable results. We consider two data sets: the com- 
positional data of colourless Romano-British vessel glass where the variables mea- 
sured sum to 100%, and the sub-compositional data of Roman bricks and tiles from 
the Rhine area where the variables measured sum to approximately 100%. 



2 Data transformation in archaeometry 

Let I objects x, be on hand for J variables. That is, a data matrix X = (x,y) with 
elements x,/ > 0 is under investigation. For compositional data, Aitchison (1986) 
recommended the log-ratio transformation 



yij = \og{xij!g{xi)) , 



( 1 ) 



Data Transformation in Archaeometry 683 



where g(x,) = {xnxa ■ ■ - XijY^^ is the geometric mean of the /th object. This trans- 
formation is restricted to values > 0. Baxter and Freestone (2006) criticized that 
Aitchison argued that all others transformations are "meaningless" and "inappropri- 
ate" for compositional data. The authors presented the failure of PCA for different 
data sets based on the log-ratio transformation. In Section 5 below the failure of 
cluster analysis methods based on the log-ratio transformation will be presented. 

The transformation of the variables by 

yij = {xij-Xj)lsj ( 2 ) 

is known as standardization. Herein Xj and sj are the mean and standard deviation of 
variable j, respectively. The new variables y/ has mean equals 0 and variance equals 
1 . The logarithmic transformations 

yij = \og{xij) (3) 

or 

yij = \og{xij+l) (4) 

can handle skew densities, where (3) is restricted to values x,y > 0, as the log-ratio 
transformation (1). Here the meaning of differences is changed. 



3 Transformation into ranks 

The multivariate statistical analysis based on ranks rather than based on the original 
data solves the problems of different scales and skewness. The influence of outliers 
is removed in the univariate case. In the multivariate case, the influence of outliers 
is highly reduced usually but theoretically the problem of outliers remains to some 
degree (Rohatch et al. (2006)). 

Table 1. Measurements and the corresponding ranks of MnO 

Value aol a02 003 004 005 006 007 008 009 OTO Oil 0H3~ 

Frequency 17 18 20 7 1 5 4 2 3 1 1 1 

Rank 9 26.5 45.5 59 63 66 70.5 73.5 76 78 79 80 



Transformation into ranks is quite simple: one replaces the measurements by 
their ranks 1,2, ... ,/ where I is the number of observations. The mean of each of 
the new rank order variables become the same: (7-T l)/2. Moreover, the variance of 
each of the new variables become the same: (/^ — 1)/12. In case of multiple values 
we recommend to average the corresponding ranks (Figure 1). Table 1 contains both 
the original values and the ranks of MnO of the 80 objects (see also Figure 1, data 
source: Baxter and Freestone (2006)). 

Mucha (1992) presented a successful application of partitioning cluster analysis 
based on rank data. Also, Mucha (2007) investigated the stability of hierarchical 
clustering based on rank data. The aim of this paper here is to show that cluster 
analysis based on rank data gives good results and that it can outperform log-ratio 
cluster analysis. 
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Fig. 2. PCA plot of groups of Romano-British vessel glass based on ranks (left hand side), and 
PCA plot of group membership based on log-ratio transformed data (right). 




Fig. 3. Fingerprint of the true Euclidean distances of rank data (left) and of log-ratio trans- 
formed data (right). (Small distances are marked by dark gray, great distances by light gray.) 



4 Distances and cluster analysis 

Henceforth let us focus on the squared Euclidean distances in cluster analysis be- 
cause PCA is based on the same distance measure and the PCA plots are very pop- 
ular in archaeometry (Baxter (1995), Baxter and Freestone (2006)). Cluster analysis 
and PCA are multivariate statistical methods that are based on distance measures. 
Further let us restrict to the well-known hierarchical Ward’s method (Spath (1985)). 
It is the simplest of the model-based Gaussian clustering methods that are applied by 
Papageorgiou et al. (2002) for finding groups of artefacts. 
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In case of the log-ratio transformation (1) the squared Euclidean distance be- 
tween two objects i and h is 

=^(yiy-y,7,)^ = ^(log^-log-^)2. (5) 

Often it is called Aitchison distance. Appropriate clustering techniques for squared 
Euclidean distances are the the partitioning AT-means method (Mucha (1992)) and 
the hierarchical Ward’s method, as mentioned already above. 



5 Romano-British vessel glass classified 

This is simulated data based on real data of colourless Romano-British vessel glass 
(Baxter et al. (2005)). Details and the complete source can be taken from Baxter and 
Ereestone (2006). This example is based on two groups that are well-known different. 

Group 1 consists of 40 cast bowls with high amounts of Ee203. Group 2 also 
consists of 40 objects: this is a collection of facet-cut beakers with low AI2O3. In 
Elgure 2 at the left hand side, the two groups are shown in the first plane of the PCA 
based on rank data. This projection gives a good approximation of the distances 
between objects. Axis 1 (39%) and axis 2 (20%) are highly significant (see Lebart et 
al. (1984) for tables of significance of eigenvalues of PCA). The Ward’s method finds 
the true groups without any error. The same optimum clustering result is obtained 
when using the transformation (4). 

In Eigure 2 at the right hand side, the two groups are presented by the PCA plot 
after the data transformation by (1). This transformation produces outliers such as 
the object 79 that is drawn additionally. The PCA is based on the Aitchison distance 
measure (5). In the two-dimensional projection the distances are approximative ones. 
The Ward’s method never finds the true two groups. Table 2 at the left hand side 
shows the very low correspondence between the given groups and the clusters found. 
The same bad cluster analysis result is obtained when using the transformation (3). 
The transformation (2) performs here much better: the Ward’s method results In 5 
errors only (see Table 2 at the right hand side). The corresponding PCA-plot of the 
standardized data using (2) is published as Figure 8 by Baxter and Freestone (2006). 
There is no outlier in this plot as well as in the plot of Figure 2 at the left hand side. 



Table 2. Tme groups versus clusters 



True 


Ward’s method 


with (1) 


Ward's method 


with (2) 


Groups 


Cluster 1 


Cluster 2 


Cluster 1 


Cluster 2 


Cast bowls 


27 


9 


37 


3 


Facet-cut beakers 


33 


8 


2 


38 



Figure 3 compares two fingerprints of the Euclidean distances of rank data (left 
hand side) and of log-ratio transformed data (right), respectively. Here the objects are 
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sorted first by group and then within the group by the first principal component based 
on rank analysis and by the hrst principal component based on log-ratio scaling, 
respectively. The fingerprint at the right hand side shows no clear class structure. 
Additionally, the outlier 79 is marked at the bottom. The corresponding high distance 
values to all the remaining objects build the eye-catching column and row in light 
gray, respectively. 
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Fig. 4. PCA plot of group membership based on rank data. 



6 Roman bricks and tiles classified 

Roman bricks and tiles from the Rhine area are described by 19 chemical elements 
that were measured using X-Ray Fluorescence Analysis (XRF). All the chemical 
measurements were performed by G. Schneider of the Freie Universitat Berlin. Two 
well-known locations of production are GroB-Krotzenburg and StraBburg-Konigshofen 
(Dolata (2000)). In this reference the author published the complete data source. It 
is possible to confirm the two well-known groups by cluster analysis based on rank 
data? 

Figure 4 shows the PCA plot of the two groups based on rank data. The hierar- 
chical Ward’s method method finds the true groups without any error. 

In Figure 5 the two groups are shown by the PCA projection based on the data 
transformation (1). Here Ward’s method finds the groups but one error occurs: the 
outlier at the bottom at the left hand side coming from StraBburg-Konigshofen is 
misclassified. 
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7 Summary 

There are different data transformations in use in archaeometry with advantages and 
disadvantages. Comparison of different data transformations based on simulated and 
real data shows that transformation Into ranks is useful In the case of outliers and 
skew densities. However, most of the quantitative information is lost by going to 
ranks. From archaeological point of view rank analysis gives reasonable results. 
Other transformations (like AitchisonSs log-ratio or (3)) are highly affected by out- 
liers, skew densities and values near 0. Therefore finding the true groups by cluster 
analysis fails In the case of glass data. Moreover, new artificial oufliers can be pro- 
duced by fransformafions such as (1) and (3) in case of measurements near zero. 
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Abstract. In sensory analysis a panel of assessors gives scores to blocks of sensory attributes 
for profiling products, thus yielding a three-way table crossing assessors, attributes and prod- 
ucts. In this context, it is important to evaluate the panel performance as well as to synthesize 
the scores into a global assessment to investigate differences between products. Recently, a 
combined approach of fuzzy regression and PLS path modeling has been proposed. Euzzy 
regression considers crisp/fuzzy variables and identifies a set of fuzzy parameters using op- 
timization techniques. In this framework, the present work aims to show the advantages of 
fuzzy PLS path modeling in the context of sensory analysis. 



1 Introduction 

In sensory analysis a panel of assessors gives scores to blocks of sensory attributes 
for profiling products, thus yielding a three-way table crossing assessors, attributes 
and products. This type of data are characterized by three different sources of com- 
plexity: complex structure of relations among the variables (different blocks), three 
directions of information (samples, assessors, attributes) and influential human be- 
ings’ involvement (assessors’ evaluations). 

Structural Equation Models (SEM) (Bollen, 1989) consist of a network of causal 
relationships among Latent Variables (LV) defined by blocks of Manifest Variables 
(MV). The main idea behind SEM is that the features on which the analysis would fo- 
cus cannot be properly measured and are determined through the measured variables. 
In a recent contribution (Tenenhaus and Esposito- Vinzi, 2005), SEM have been suc- 
cessfully used to analyze sensory data. When SEM are based on the scores of a set 
of assessors, they are generally based on the mean scores. However, it is important 
to analyze if there exist individual differences between assessors. Even if assessors 
are carefully trained to adopt the same yardstick, this cannot completely protect us 
against their single sensibility. 
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When human estimation is influential and the observations cannot be described 
accurately but we can give only an approximate description of them, fuzzy approach 
is more useful and convenient than the classical one (Zadeh, 1965). Fuzzy sets al- 
low us coding and treating many different kinds of imprecise data. Recently, a fuzzy 
approach to SEM has been proposed (Romano, 2006) and successively used for com- 
paring different SEM (Romano and Palumbo, 2006b). 

The present paper proposes to use the new fuzzy structural equation models for 
handling the different sources of information and uncertainty arising from sensory 
data. Eirst a brief introduction to the methodology of reference (Romano, 2006) will 
be given. Then an application to data from sensory profiling will be presented. 



2 Fuzzy PLS path modeling 

Euzzy PLS Path Modeling is a new methodology to dealing with system complexity. 
It allows us taking into account both complexity in information codification and in 
structures of relations among the variables. Fuzzy codification and structural equa- 
tions are combined to handling these different sources of complexity, respectively. 

The strategy allowing imprecision in codification for reducing complexity is ap- 
propriately expressed by Zadeh’s principle of incompatibility (Zadeh, 1973). The 
main idea is that the traditional techniques for analyzing systems are not well suited 
to dealing with human systems. In human thinking, the key elements are not numbers 
but classes of objects or concepts in which the membership of each element to the 
class is gradual (fuzzy) rather than sharp. For instance, the concept of sweet coffee 
does not correspond to an exact amount of sugar in the coffee. But it is possible to 
define the classes sweet coffee, normal coffee, bitter coffee. 

On the other hand, the descriptive complexity of a system can also be reduced by 
breaking the system into its appropriate subsystems. This is the general principle 
behind Structural Equation Models (SEM) (Bollen, 1989). The basic idea is that dif- 
ferent subsets of variables are the expression of different concepts, belonging to the 
same phenomenon. These concepts are named latent variables (LV) as they are not 
directly observable but measurable by means of a set of manifest variables (MV). 
The aim of SEM is to study the system of relations between each LV and its MV, and 
among the different LV inside the system. Considering one by one each part forming 
the whole system, and analyzing the relations among the different parts, the system 
complexity is reduced allowing a better description of the main system characteris- 
tics. 

F-PLSPM consists in introducing fuzzy models inside SEM, by means of a two- 
stage procedure. This allows dealing with system complexity using both an approach 
which is tolerant to imprecision and a well suited methodology to link the different 
parts into which the system may be decomposed. 

2.1 Interval data, fuzzy data and fuzzy models 

It is very common to measure statistical variables in terms of single- values. How- 
ever, for many reasons, and in many situations exact measures are very hard (or even 
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impossible) to achieve. 

A rigorous study of interval data is given by Interval Analysis (Alefeld and Herzen- 
berger, 1987). In this framework, an interval value is a bounded subset of real num- 
bers [x] = [x,x], formally: 



[x] = {x G K| X < X < x} (1) 

where x and x are called lower and upper bound, respectively. Alternatively, an in- 
terval value may by expressed in terms of width (or radius), x„, and center (or mid- 
point), Xc- x„ = jjx — x| andxc = j|x-|-x|. 



A fuzzy set is a codihcation of the information allowing us to represent vague 
concepts expressed in natural language. Formally, given the universe of objects m 
as the generic element, & fuzzy set A in is dehned as a set of ordered pairs: 

A = {((o,^^(m))|(o G il} (2) 



where the value /i^(o)o) expresses the membership degree for a generic element (Oq G 
The larger the value of the higher the degree of membership of (O in A. If 

the membership function is permitted to have only the values 0 and 1 then the fuzzy 
set is reduced to a classical crisp set. The universal set £2 may consist of discrete 
(ordered and non ordered) objects or it can be a continuous space. 

A fuzzy set in the real line that satishes both the conditions of normality and 
convexity is zl fuzzy number. 

It must be normal so that the statement “real number close to r" is fully satisfied by r 
itself, i.e. p^{r) = 1. In addition, all its a— cuts for a ^ 0 must be closed intervals so 
that the arithmetic operations on fuzzy sets can be defined in terms of operations on 
closed intervals. On the other hand, if all its a— cuts are closed intervals, it follows 
that the fuzzy number is a convex fuzzy set. 

In possibility theory (Zadeh, 1978), a branch of fuzzy set theory, fuzzy numbers are 
described by possibility distributions. 

A possibility distribution Jt^((o) is a function which satisfies the following condi- 
tions (Tanaka and Guo, 1999): i) there exists an (O such that Jt, 4 (tt)) = 1 (normality); 
ii) a— cuts of fuzzy numbers are convex; Hi) is piecewise continuous. 

Particular fuzzy numbers are the symmetrical fuzzy numbers whose possibility dis- 
tribution may be denoted as: 



Jt^. (m) = max 





(3) 



Specifically, (3) corresponds to triangular fuzzy numbers when q= 1, to square root 
fuzzy numbers when q= lj2 and parabolic fuzzy numbers when q = 2.lt is easy to 
show that (3) corresponds to intervals when q = -|-o°. 

It is worth noticing that fuzzy variables are associated with possibility distributions 
in the similar way that random variables are associated with probability distributions. 
Furthermore, /705Si7>i/(Ty distributions are numerically equal to membership functions 
(Zadeh, 1978). 
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In the early 80’s, Tanaka proposed the first fuzzy linear regression model, moving 
on from fuzzy sets theory and possibility theory (Tanaka et al, 1980). The functional 
relation between dependent and independent variables is represented as a fuzzy linear 
function whose parameters are given by fuzzy numbers. Tanaka proposed the first 
Fuzzy Possibilistic Regression (FPR) using the following fuzzy linear model with 
crisp input and fuzzy parameters: 

yn = 3o + + ■ • • + ^pX„p, + . . . + ^px„p (4) 

where the parameters are symmetric triangular fuzzy numbers denoted by (3p = 
{cp',Wp)p with Cp and Wp as center and the spread, respectively. 

Differently from statistical regression, the deviations between data and linear models 
are assumed to depend on the vagueness of the parameters and not on measurement 
errors. The basic idea of Tanaka’s approach was to minimize the uncertainty of the 
estimates, by minimizing the total spread of the fuzzy coefficients. Spread minimiza- 
tion must be pursued under the constraint of the inclusion of the whole given data 
set, which satisfies a degree of belief a (0 < a < 1) defined by the decision maker. 
The estimation problem is solved via a mathematical programming approach, where 
the objective function aims at minimizing the spread parameters, and the constraints 
guarantee that observed data fall inside the fuzzy interval: 

N P 

minimize ^ ^ Wp \xnp \ (5) 

n={ p=0 

subject to the following constraints: 

ttpXnp''j + (1 — (X) ^Wo + X/p=l ^ y« 

+ X//)=l ttpXnp^ (1 CX) + ^2p=l ^ yn 

Wp > 0,Cp eR,XnO = l,n = (l,...,A^),p= (1,...,P) 

where x„o = I {n= I, . . . ,N), Wp>Q and Cp & R {p = . ,P). 

2.2 The F-PLSPM algorithm 

The F-PLSPM follows the component based approach SEM-PLS, alternatively de- 
fined PLS Path Modeling (PLS-PM) (Tenenhaus et al., 2005). The reason is that 
fuzzy regression and PLS path modeling share several characteristics. They are both 
soft modeling and data oriented approaches. 

Specifically, fuzzy regression joins PLS-PM in its final step, allowing for a fuzzy 
structural model (see. Figure 1) but a still crisp measurement model. This connection 
implies a two stage estimation procedure: 

• stage 1 : latent variables are estimated according to the PLS-PM estimation pro- 
cedure (Wold, 1982); 
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G> 




Fig. 1. Fuzzy path model representation 



• stage 2: FPR on the estimated latent variables is performed so that the following 
fuzzy structural model is obtained: 



^/) = 3/jO + ^ 
h' 



( 6 ) 



where P/,;,/ refers to the generic fuzzy path coefficient, and are adjacent 
latent variables and h,h' G [1, . . . ,H] vary according to the model complexity. 

It is worth noticing that the structural model from this procedure is different with 
respect to the traditional structural model. Here the path coefficients are fuzzy num- 
bers and there is no error term, as a natural consequence of a FPR. In the analysis of 
a statistical model one should always, in one way or another, take into account the 
goodness of fit, above all in comparing different models. The proposal is then to use 
the FPR. The estimation of fuzzy parameters, instead of single-valued (crisp) param- 
eters, permits us to gather both the structural and the residual information. The char- 
acteristic to embed the residual in the model via fuzzy parameters (Tanaka and Guo, 
1999) permits to evaluate the differences between assessors (panel performance) as 
well as the reproducibility of each assessor (assessor performance) (Romano and 
Palumbo, 2006b). 



3 Application 

The data set comes from sensory profiling of 14 cheese samples by a panel of 12 
assessors on the basis of twelve attributes in two replicates. 

The final data matrix consists of 336 rows (12 assessors x 14 samples x 2 repli- 
cates) and 12 columns (attributes: intensity odour, acidic odour, sun odour, rancid 
odour, intensity flavour, acidic flavour, sweet flavour, salty flavour, bitter flavour, sun 
flavour, metallic flavour, rancid flavour). Two blocks of variables describe the latent 
variables odour and flavour. First the hierarchical PLS model proposed by Tenen- 
haus and Vinzi (2005) will be used to estimate a global model after averaging over 
the assessors and the replicates (see. Figure 2). Thus, collapsing the data structure 
into a two-way table (samples x attributes). Then fuzzy PLS path modeling will 
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provide two sets of synthesized assessments: the overall latent scores for each prod- 
uct and the partial latent scores for the different blocks of attributes. The synthesis of 
scores into a global assessment permits to investigate differences between products. 
However, in such a way, we lose all the information on the individual differences 
between assessors. At this aim, as many path models as assessors will be considered 
and compared in terms of fuzzy path coefficients so as to detect eventual hetero- 
geneity in the panel. Figure 2 shows the global path model. As can be seen, the latent 
variable global depends on the two latent variables odour anA flavour. The F-PLSPM 




algorithm is used to estimate the, fuzzy path coefficients (3i and 32 )- Crisp path co- 
efficients in Table 1 show that the global quality of the products mostly depends on 
the flavour rather than on the odour. Furthermore, fuzzy path coefficients describe a 
worse panel performance for the flavour emphasized by a more imprecise estimate 
(wider fuzzy interval). Therefore, the F-PLSPM algorithm enriches the results of 
the classical PLSPM crisp approach by providing information on the imprecision of 
path coefficients. At the same time, the coherence of results is granted as the crisp 
estimates are comprised within the fuzzy intervals. 



Table 1. Global Model Path Coefficients 



Latent Variable 


crisp path coefficients 


fuzzy path coefficients 


Odour 

Flavour 


0.4215 

0.6283 


[0.3952:0.4517] 

[0.6043:0.7817] 



The most interesting result coming from the proposed approach is in Figure 3, which 
compares the interval valued estimates on the different assessors. 

Figure 3 reports the fuzzy path coefficients for the 12 local models referred to 
each assessor. By looking within each plot (flavour and odour) separately, the asses- 
sor performance and the coherence between assessors can be evaluated: a) the wider 
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Fig. 3. Local fuzzy path coefficients 



the interval, the less consistent is the assessor; b) the closer the intervals between 
them, the more coherent are the assessors. In the example, for the odour, assessor 
7 is the least consistent assessor while assessor 12, being positioned far away from 
the rest of the assessors, is the least coherent as compared to the panel. Finally, by 
comparing the two plots, differences in the way each assessor perceives flavour and 
odour may be detected; for instance, assessor 7 is the most imprecise for the odour 
while it is extremely consistent for the flavour; assessor 12 is similarly consistent for 
both flavour and odour but, in both cases, it is in clear disagreement with the panel 
(a much higher influence of the odour as opposed to a much lower influence of the 
flavour). 



4 Conclusion 

The joint use of PLS component-based approach to structural equation modeling 
and fuzzy possibilistic regression has yielded promising results in the framework of 
sensory data analysis. Namely, while taking into account the multi-block feature of 
sensory data, the proposed Fuzzy-PLSPM leads to a fuzzy estimation of the path 
coefficients. Such an estimation provides information on the precision of the classi- 
cal estimates and allows a thorough comparison of the sensory evaluations between 
assessors and within assessors for different products. Future directions of research 
aim to extend the fuzzy approach also to the measurement model by introducing 
an appropriate fuzzy possibilistic regression in the external estimation phase of the 
PLSPM algorithm. This further development has a twofold interest: allowing for 
fuzzy input data; yielding fuzzy estimates of the loadings, of the outer weights and, 
as a consequence, of the latent variable scores, thus embedding the measurement 
error that naturally affects sensory assessments. 
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Abstract. The Dewey Decimal Classification (DDC) was conceived by Melvil Dewey in 
1873 and published in 1876. Nowadays, the DDC serves as a library classification system in 
about 138 countries worldwide. Recently, the German translation of the DDC was launched, 
and since then the interest in DDC has rapidly increased in German- speaking countries. The 
complex DDC system (Ed. 22) allows to synthesize (to build) a huge amount of DDC no- 
tations (numbers) with the aid of instructions. Since the meaning of built DDC numbers is 
not obvious - especially to non-DDC experts - a computer program has been written that au- 
tomatically analyzes DDC numbers. Based on Songqiao Liu’s dissertation (Liu (1993)), our 
program decomposes DDC notations from the main class 700 (as one of the ten main classes). 
In addition, our program analyzes notations from all ten classes and determines the meaning 
of every semantic atom contained in a built DDC notation. The extracted DDC atoms can be 
used for information retrieval, automatic classification, or other purposes. 



1 Introduction 

While searching for books, journals, or web resources, you will often come across 
numbers such as "025.1740973", "016.02092", or "720.7073". What do they mean? 
Librarian professionals will identify these strings as numbers (notations) of the 
Dewey Decimal Classification (DDC), which is named after its creator, Melvil Dewey. 
Originally, Dewey designed the classification for libraries, but in the meantime DDC 
has also been discovered for classifying the web or other resources. The DDC is used, 
among others, because it has a long-standing tradition and is still up to date: in order 
to cope with scientific progress, it is currently under development by a ten-member 
international board (the Editorial Policy Committee, EPC). While the first edition, 
which was published in 1876, only comprised a few pages, the current 22nd edition 
of the DDC spans a four-volume work with almost 4,000 pages. Today, the DDC 
contains approx. 48,000 DDC notations and about 8,000 instructions. The DDC no- 
tations are enumerated in the schedules and tables of the DDC. With the aid of the 
instructions mentioned above, human classifiers can build new synthesized notations 
(numbers) if these are not specifically listed in the DDC schedules. This way, an 
enormous amount of synthesized DDC notations has been built intellectually over 
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the last 130 years. These mostly unused notations are contained in library catalogues 
- like a hidden treasure. They can be considered as belonging to the "Deep Lib", one 
of the subsets of the "Deep Web" (Bergman (2001)). Can these notations be made 
accessible for information retrieval purposes with reasonable effort? 

Our answer to this question consists in the automatic analysis of notations of the 
DDC. The analysis program we have developed determines all DDC notations (to- 
gether with their corresponding captions) contained in a synthesized (built) DDC 
notation. Before we go into details of the automatic analysis of DDC notations in 
section 3, section 2 provides the basis for the analysis. In section 4, the results are 
presented, and section 5 draws a conclusion. 



2 DDC notations 

Notations play an important role in the DDC: 

"Notation is the system of symbols used to represent the classes in a classifica- 
tion system. ... The notation provides a universal language to identify the class and 
related classes, regardless of the fact that different words or languages may be used 
to describe the class." (http://www.oclc.org/dewey/versions/ddc22/intro.pdf) 

The following picture serves as an example for the aforesaid. Class C is rep- 
resented by the notation 025.43 or, respectively, by the captions of three different 
languages: 




In compliance with the DDC system, the automatic analysis of notations of the 
DDC is carried out in the VZG (VerbundZentrale des Gemeinsamen Bibliotheksver- 
bundes) project Colibri (CGntext generation and L/nguistic tools for Bibliographic 
Retrieval /nterfaces). The goal of this project is to enrich title records on the basis of 
the DDC to improve retrieval. The analysis of DDC notations is conducted under the 
following research questions (which are also posed in a similar way in Liu (1993), 
p. 18): Ql. Is it possible to automatically decompose molecular DDC notations into 
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atomic DDC notations? Q2. Is it possible to improve automatic classification and 
retrieval by means of atomic DDC notations? An atomic DDC notation is a semanti- 
cally indecomposable string (of symbols) that represents a DDC class. A molecular 
DDC notation is a string that is syntactically decomposable into atomic DDC nota- 
tions. 

DDC notations can be found at several places in the DDC. In DDC summaries, 
the notations for the main classes (or tens), the divisions (or hundreds), and the 
sections (or thousands) are enumerated. Other notations are listed in the schedules 
("DDC schedule notations") or tables ("DDC table notations") or internal tables. 
DDC schedules are "the series of DDC numbers 000-999, their headings (captions), 
and notes." (Mitchell (1996), p. Ixv). A DDC table is "a table of numbers that may be 
added to other numbers to make a class number appropriately specific fo fhe work be- 
ing classified" (Mifchell (1996), p. Ixv). Furfher nofafions are contained in the "Rel- 
ative Index" of the DDC. The frequency distributions of schedule (table) notations 
are shown in Fig. 2 (Fig. 3), while schednoO is short hand for DDC schedule nota- 
tions beginning with 0, schednol for DDC schedule notations beginning with 1, etc. 
The captions for the main classes are: 000: Computer science, information & gen- 
eral works; 100: Philosophy & psychology; 200: Religion; 300: Social sciences; 400: 
Language; 500: Science; 600: Technology; 700: Arts & recreation; 800: Literature; 
900: History & geography. As illustrated by Fig. 2, DDC notations are not distributed 
uniformly: the most schedule notations can be found in the class "Technology", fol- 
lowed by the notations in the class "Social sciences". The fewest notations belong 
to the class "Philosophy & psychology". With regard to the table notations (Fig. 3), 
the 7,816 Table 2 notations ("Geographic Areas, Historical Periods, Persons") stand 
out, whereas, in contrast, the quantities of all other table notations are comparatively 
small (Table 1: Standard Subdivisions; Table 3: Subdivisions for the Arts, for Indi- 
vidual Literatures, for Specific Literary Forms; Table 4: Subdivisions of Individual 
Languages and Language Families; Table 5: Ethnic and National Groups; Table 6: 
Languages). 

As mentioned before, DDC notations that are not explicitly listed in the schedules 
can be built by using DDC instructions. This process is called "notational synthesis" 
or "number building". Its results are synthesized DDC notations (molecular DDC 
notations) that usually only DDC experts are able to interpret. But with the aid of 
our computer program "DDC analyzer", the meaning of molecular DDC notations 
is revealed and the determined atomic DDC notations can be used, among others, to 
answer question Q2. 



3 Automatic analysis of DDC notations 

The GBV Union Catalog GVK (Gemeinsamer VerbundA'atalog, http://gso. 
gbv.de/) contains 3,073,423 intellectually DDC-classified title records (status: July, 
2004). After the automatic elimination of segmentation marks, obviously incorrect 
DDC notations (3.8 per cent of all DDC notations), and duplicate DDC notations, a 
total of 466,134 different DDC notations is available for the automatic analysis of 
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sched sched sched sched sched sched sched sched sched sched 

noO nol no2 no3 no4 no5 no6 no7 no8 no9 



Fig. 2. Frequency distribution of DDC schedule notations 




tabnol tabno2 tabno3 tabno4 tabnoS tabno6 



Fig. 3. Frequency distribution of DDC table notations 



DDC notations. This set of all GVK DDC notations serves as input data for the DDC 
analyzer. The frequency of DDC schedule notations is as follows (in descending 
order): those beginning with 3 (189,246), with 9 (62,115), with 7 (52,632), with 6 
(51,704), with 5 (33,649), with 0 (23,946), with 2 (20,888), with 8 (20,678), with 4 
(6,680), and with 1 (4,596). The arity of DDC notations of all GVK DDC notations 
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is Gaussian distributed with a maximum at 10, i.e. most DDC notations have approx, 
arity 10, the shortest DDC notation has arity 1, the longest DDC notation has arity 
29. Other important input data for the DDC analyzer we used were the 600 DDC 
numbers given in Liu’s dissertation. These 600 DDC numbers that we call "Liu’s 
sample" were randomly selected from class 700 from the OCLC database by Liu. 

As a member of the Consortium DDC German, we have access to the machine- 
readable data of the 22nd edition of the DDC system. These data are stored in an xml 
file. The English electronic web version is available as WebDewey 
(http://connexion.oclc.org/), the German pendant as MelvilClass (http://services.ddc- 
deutsch.de/melvilclass-login). For our purpose, only the relevant data of the xml file, 
which contains the expert knowledge of the DDC system, are extracted and stored 
in a "knowledge base". Here, DDC notations, descriptors, and descriptor values are 
stored in consecutive fields, while facts and rules - as we call them - are represented 
in a very similar way: 

Tl-093-Tl-099-n021#<ba4r2>#Statistics 
025.17#<nalrl>##025.17#025.341-025.349#025.34##### 
025.344#<hat>#Electronic resources 

The three example lines of the knowledge base should be read as follows: Fact. 
T1-093-T1-099H-021 has the caption "Statistics". Rule: Add to base number 025.17 
the numbers following 025.34 in 025.341-025.349. Fact. 025.344 has the caption 
"Electronic resources". ’#’ serves as field separator. The xml tags that are given in 
angle brackets stand for: "ba4" ("beginning of add table (all of table number)"), "nal" 
("add note (part of schedule number)") and "hat" ("hierarchy at class"), "rl" and "r2", 
which follow "nal" or, respectively, "ba4", stand for the first two macro rules. The 
knowledge base contains 48,067 facts and 8,033 rules. The 8,033 rules can be gener- 
alized to macro rules. While Liu (1993) defined 17 (macro) rules for the decomposi- 
tion for class 700, we defined 25 macro rules for all DDC classes. 

Our program, the DDC analyzer, works as follows: after initializing variables, it 
reads the knowledge base and, triggered by one or more DDC notations to be an- 
alyzed, executes the analysis algorithm. The number of correct and incorrect DDC 
notations is counted. For a DDC notation, there are two phases to the analyzing pro- 
cess including: determining the facts from left to right (phase 1) and determining 
the facts via rules from left to right (phase 2). After checking which output for- 
mat has to be printed, the result is printed as a DDC analysis diagram or as a DDC 
analysis result set. After all DDC notations have been analyzed, the number of to- 
tally/partially analyzed DDC notations is printed. There are different reasons for a 
partially analyzed DDC notation: either the implementation of the DDC analyzer is 
incorrect/incomplete or the DDC notation is incorrectly synthesized or a part of the 
DDC system itself is incorrect. 
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4 Results 

To demonstrate our progress in comparison with Liu’s work, we compare his decom- 
position result with our DDC analysis diagram for the 37th molecular DDC notation 
of his sample: 

Liu (1993), pp. 99-100 
720.7073 has been decomposed as follows: 

720: Architecture 

0707: Geographical treatment 

73: United States 

The title of this book is: 

#a Voices in architectural education: #bcultural politics and 
The subject headings for this book are: 

#aArchitecture #xStudy and teaching #zUnited States. 

#aArchitecture and state #zUnited States. 

Reiner (2007a), p. 49 

720.7073 <liu_37_to_analyze; length: 8> 

7 Arts & recreation <hatzen> 

72 Architecture <hatzen> 

720 — Architecture <hat> 

-0 . 7-- Education, research, related topics <Tl-07> 

-0.707 - Geographic treatment <Tl-0707> 

-- . -7- North America <na4r7span:Tl-0701-Tl-0709:T2-7> 

--.-73 United States <na4r7span:Tl-0701-Tl-0709:T2-73> 

The information given in angle brackets should be read as follows: "hatzen" is the 
concatenation of "hat" ("hierarchy at class") and "zen" ("zen built entry (main tag)"). 
"T1-" stands for "table 1", "T2-" for "table 2", "na4" for "add note (add of table 
number)", "r7" for "macro rule 7", "span" for "span of numbers", and for "delim- 
iter". As you can see, while Liu decomposes the synthesized DDC notation into three 
chunks, our DDC analysis diagram shows the finest possible analysis of the molecu- 
lar DDC notation. The fine analysis provides the advantage of uncovering additional 
captions: "Arts & recreation", "Architecture", "North America", and "Education, re- 
search, related topics". 

A DDC analysis diagram contains analysis and synthesis information: 1. the 
molecular DDC notation to be analyzed; 2. an identifier (name) and the length of 
the molecular DDC notation; 3. the sequence and position of the digits within the 
molecular DDC notation; 4. the Dewey dot at position 4; 5. the relevant parts of 
the molecular DDC notation for each analysis step; 6. the corresponding caption for 
every atomic DDC notation; 7. the parts irrelevant for the respective analysis step 
marked with 8. the type of the applied facts and rules that appear in angle brack- 
ets. In case it has been explained how to read the given information mentioned in 8., 
every synthesis step can be reproduced. While DDC analysis diagrams are intended 
for human experts, the DDC analysis result set can be used for data transfer. Cur- 
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rently, we distinguish three kinds of analysis result sets. The first one is a set of DDC 
<notation;caption> tuples: 

7 ; Arts & recreation 
72; Architecture 
7 20 ; Architecture 

Tl-07;Education, research, related topics 
Tl-0707;Geographic treatment 
T2-7;North America 
T2-73;United States 

The second one delivers all DDC notations contained in a synthesized number: 
liu_37:720.7073;7;72;720;Tl-07;Tl-0707;T2-7;T2-73 
The third analysis result set is in MAB2 format: 
705a^a720.7073^p72^cTl-070^f0707^g73 

All 600 analyzed DDC notations of Liu’s sample have been compared accordingly 
with the results of Liu (1993). It turns out that Liu’s decompositions can be repro- 
duced. Minor differences result from printing errors in his dissertation and the usage 
of different (20th/22nd) DDC editions. After 14 years, 36 DDC notations of Liu’s 
sample are out of date because of relocations and discontinuations. As far as the 
analysis of the 466,134 GVK DDC notations of all DDC classes is concerned, cur- 
rently 297,782 (168,352) DDC notations can be totally (partially) analyzed, i.e. 63.9 
per cent (36.1 per cent) are totally (partially) analyzed. In some DDC classes, the 
analyzing degree is even higher, which means that, e.g., 87 per cent of the 51,704 
DDC notations of the class "Technology" (600) can be totally analyzed. 



5 Conclusion 

In 1993, Liu showed that DDC synthesized class numbers of main class 700 can 
be decomposed automatically. Our program analyzes notations from all ten main 
classes. Compared to Liu’s approach, our analysis procedure delivers more infor- 
mation, which is furthermore presented in a new way. Since Liu’s expert-evaluated 
results are reproduced, we can (statistically) infer that our DDC analyzer works cor- 
rectly with high probability. Increasing the quantity of DDC notations totally ana- 
lyzed will be the next step. The results can be used to improve (multilingual) DDC 
information retrieval or DDC automatic classification systems. On the basis of analy- 
sis diagrams, DDC tutorials or expert systems could be developed to support teaching 
of DDC number building or to control the quality of built DDC numbers. 
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Abstract. Interval data allow statistical units to be described by means of interval values, 
whereas their representation by single values appears to be too reductive or inconsistent, that 
is, unahle to keep the uncertainty usually inherent to the observed data. In the present paper, 
we present a novel distance for interval data based on the Wasserstein distance between dis- 
tributions. We show its interesting properties within the context of clustering techniques. We 
compare the obtained results using the dynamic clustering algorithm, taking into consideration 
different distance measures in order to justify the novelty of our proposal.' 



1 Introduction 

The representation of data by means of intervals of values is becoming more and 
more frequent in different fields of application. In general, an interval description 
depends on the uncertainty that affects the observed values of a phenomenon. The 
uncertainty can be considered as the inability to obtain true values depending on ig- 
norance of the model that regulates the phenomenon. This uncertainty can be of three 
types: randomness, vagueness or imprecision (Coppi et al., 2006). Randomness is 
present when it is possible to hypothesize a probability distribution of the outcomes 
of an experiment, or when the observation is affected by an error component that is 
modeled as a random variable (i.e., white noise in a Gaussian distribution). Vague- 
ness is related to an unclear fact or whether the concept even applies. Imprecision is 
related to the difficulty of accurately measuring a phenomenon. While randomness 
is strictly related to a probabilistic approach, vagueness and imprecision have been 
widely treated using fuzzy set theory, as well as the interval algebra approach. Prob- 
abilistic, fuzzy and interval algebra sometimes overlap in treating interval data. In 
the literature, interval algebra and fuzzy theory are treated very closely, especially in 
defining dissimilarity measures for the comparison of values affected by uncertainty 
expressed by intervals. 

Interval data have even been studied in Symbolic Data Analysis (SDA) (Bock and 
Diday (2000)), a new domain related to multivariate analysis, pattern recognition and 

* The present paper has been supported by the LC^ Italian research project. 
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artificial intelligence. In this framework, in order to take into account the variability 
and/or the uncertainty inherent to the data, the description of a single unit can assume 
multiple values (bounded sets of real values, multi-categories, weight distributions), 
where intervals are a particular case. In SDA, several dissimilarity measures have 
been proposed. Chavent and Lechevallier (2002) and Chavent et al. (2006) proposed 
Hausdorff L\ distances, while De Carvalho et al. (2006) proposed Lq distances and 
De Souza et al. (2004) an adaptive L 2 version. It is worth noting that these measures 
are based essentially on the boundary values of the compared intervals. These dis- 
tances have been mainly proposed as criterion functions in clustering algorithms to 
partition a set of interval data where a cluster structure can be assumed. 

In the present paper, we present some dissimilarity (or distance) functions, pro- 
posed in fuzzy, symbolic data analysis and probabilistic contexts to compare intervals 
of real values. Finally, we introduce a new metric based on the Wasserstein distance 
that respects all the classical properties of a distance and, being based on the quantile 
functions associated with the interval distributions, seems particularly able to keep 
the whole information contained in the intervals and not only on the bounds. 

The structure of the paper is as follows: in Section 2, some families of distances 
for interval data, arising by different contexts, are shown. In Section 3, the new dis- 
tance, based on a probabilistic metric, as the L 2 version of the Monge-Kantorovich- 
Wasserstein-Gini metric between quantile functions (usually known as Wasserstein 
metric), is introduced. In Section 4, the performances of the proposed distance are 
compared in a clustering process of a set of interval data. The obtained partition is 
compared to an expert one. Therefore, performing the clustering by Adaptive L 2 as 
well as Hausdorff L\ distance, the proposed distance provides again the best results 
in terms of the Correct Rand Index. Section 5 closes the paper with some remarks 
and perspectives. 



2 A brief survey of the existing distances 

According to Symbolic Data Analysis, an interval variable A is a correspondence 
between a set E of units and a set of closed intervals [a, b], where a< b and a, h € K. 
Without losing in generality, the notation is quite the same for the interval algebra 
approach. 

Let A and B be two intervals described, respectively, by [a,h] and [m,v]. d{A,B) 
can be considered as a distance if the main properties that define a distance are 
achieved: d{A,A) = 0 (reflexivity), d{A,B) = d{B,A) (symmetry) and d{A,B) < 
d(A,C) +d{C,B) (triangular inequality). 

Hereinafter, we present some of the most used distances for interval data. The 
main properties of such measures are even underlined. 

Tran and Duckstein distance between intervals. Some of these distances have been 
developed within the framework of the fuzzy approach. One of these is the distance 
dehned by Tran and Duckstein (2002) that has the following formulation: 
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1/2 1/2 , 
(1td{A,B)= f f {[{‘i^)+x{b-a)\-[{‘i^)+y{v-u)\) dxdy = 
- t / 2 - 1/2 



( 1 ) 









In practice, they consider the expected value of the distance between all the points 
belonging to interval A and all those points belonging to interval B. In their paper, 
they ensure that it is a distance, but it is easy to observe that the distance does not 
satisfy the first properties mentioned above. Indeed, the distance of an interval by 
itself is equal to zero only if the interval is thin: 



dTo{A,A) = [(^) - (^)]' + ^ Wf+i'-rf 



= f(^) >0 (2) 



Hausdorff-based distances. The most common distance used for the comparison of 
two sets is the Hausdorff distance Considering two sets A and B of points of R”, 
and a distance d{x,y) where x G A and y G B, the Hausdorff distance is defined as 
follows: 

c?// (A,B) = max sup inf <7 (x,y) ,sup inf (7(x,y) ) (3) 

If d{x,y) is the Li City block distance, then Chavent et al. (2002) proved that 

^///(A,B) = max(|a-M|,|h-v|) = + Y| W 

An analytical formulation of this metric using the Euclidean distance has been de- 
vised (Book, 2005). 

distances between the bounds of intervals. A family of distances between inter- 
vals has been proposed by De Carvalho et al. (2006). Considering a set of interval 
data described Into a space Ri’, the metric of norm q is defined as: 

/ p \ 

rfi,(A,B)= ( EJa-M^+lh-vn . (5) 

They also showed that if the norm is Loo then dp.^ = dn (in L\ norm). 

The same measure was extended (De Carvalho (2007)) to an adaptive one in order 
to take into account the variability of the different clusters in a dynamical clustering 
process. 



3 Our proposal: Wasserstein distance 

If we suppose a uniform distribution of points, an interval of reals Aft) = [a, b] can 
be expressed as the following type of function: 

^ The name is related to Felix Hausdorff, who is well-known for the separability theorem on 
topological spaces at the end of the 19‘^ century. 
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A{t) = [a^ b\ = a + t{b — a) 0<f<l. (6) 

If we consider a description of the interval by means of its midpoint m and radius r, 
the same function can be rewritten as follows: 



A{t) = m+ r{2t — \) 



0<t<l. (7) 



Then, the squared Euclidean distance between homologous points of two intervals 
A = \a,b] and B = [m, v] , or described by the midpoint-radius notation A = , r^) 

and B = is defined as follows: 

1 1 

<7^(A,B) = ^ \A{t) - B{t)f dt = j'[{mA-mB)-^{rA-rB){2tj-\)f dt = 

0 0 
= {mA - mB^ + 5 (fA - 

In this case, we assume that the points are uniformly distributed between the two 
bounds. From a probabilistic point of view, this is similar to comparing two uni- 
form density functions U{a,b) and U{u,v). In this way, we may use the Monge- 
Kantorivich-Wasserstein-Gini metric (Gibbs and Su, (2002)). Let T be a distribution 
function; ^ is the corresponding quantile function. Given two univariate random 
variables \|/a and \|/g, the Wasserstein-Kantorovich distance is defined as: 



1 



( 9 ) 



In Barrio et al. (1999), the L 2 version (defined as Wasserstein distance) of this dis- 
tance was proposed to study the weak convergence of distributions. 



dw (VB) Vb) = 






In our context, it is possible to prove that: 

dw{U{a,b),U{u,v)) = \J {pA-puf + {oa ~ ag)^ 



( 10 ) 



( 11 ) 



where pA = ^ (resp. ps = ^) and Oa = \J \2 = \J i 2 )■ 1^ general, 

given two densities \|/a and \|/g with the first two finite moments: pA = E{A) (resp. 
Pb = E(B)), Qa = \/VAR{A) (resp. ag = ^VAR{B)) and Corrqg as the correlation 
of the quantiles of T'a and Tg, Irpino and Romano (2007) proved that the (10) can 
be decomposed as: 

^^w(Va,Vb) = {pA-BBf + {oA-OBf+2oAOB[\-CorrQQ{^’A,'¥B)\ (12) 



The proposed decomposition allows the effect of the two densities on the distance 
generated by different location, different size and different shape to be considered. 
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In order to calculate the distance between two elements described by p interval vari- 
ables, we propose the following extension of the distance to the multivariate case in 
the sense of Minkowski: 



d\\r (A,B) 




(13) 



4 Dynamic clustering algorithm using different criterion 
functions 

In this section, we present the effect of using different distances as the allocation 
function for the dynamic clustering of a temperature dataset. The Dynamic Clus- 
tering Algorithm (DC A) (Diday (1971)) represents a general reference for unsuper- 
vised, not hierarchical and iterative, clustering algorithms. In particular, DCA simul- 
taneously looks for the partition of the set of data and the representation of the clus- 
ters. The main contributions to the clustering of interval data have been presented in 
the framework of symbolic data analysis, especially for defining a way to represent 
the clusters by means of prototypes (Chavent et al. (2006)). In the literature, several 
authors indicate how to compute prototypes. In particular, Verde and Lauro (2000) 
proposed that the prototype of a cluster must be considered as an element having 
the same properties of the clustered elements. In such a way, a cluster of intervals 
is described by a single prototypal interval, in the same way as a cluster of points is 
represented by its barycenter. 

Let £ be a set of n data described by p interval variables Xj (j = l,...,p). The gen- 
eral DCA looks for the partition P G Pk of E in k classes, among all the possible 
partitions Pk, and the vector L G Lk of k prototypes representing the classes in P, 
such that, the following A fitting criterion between L and P is minimized: 

A{P*,L*) = Min{A{P,L) \ P G Pk,LG Lk}. (14) 

Such a criterion is defined as the sum of dissimilarity or distance measures 5(x,, Gh) 
of fitting between each object x, belonging to a class Ch G P and the class represen- 
tation Gh G L : 

k 

A{P,L) = EE b{xi,Gh). 

h= \ XjeCh 

A prototype Gh associated to a class Ch is an element of the space of the description 
of E, and it can be represented as a vector of intervals. The algorithm is initialized 
by generating k random clusters or, alternatively, k random prototypes. Generally, 
the criterion A{P,L) is based on an additive distance on the p descriptors. 

In the present paper, we present an application based on a dynamic clustering of a 
real-world data set. The data set used in our experiments is the interval temperature 
dataset shown in Table 1, which was previously used as a benchmark interval data 
for cluster analysis in De Carvalho (2007), Guru and Kiranagi (2005) and Guru et 
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Table 1. The temperature dataset 



City 


Jan 


Feb 


Mar 


Oct 


Nov 


Dec 


Amsterdam 


[-4,4] 


[-5,3] 


[2,12] . 


. [5,15] 


[-1,4] 


[-1,4] 


Athens 


[6,12] 


[6,12] 


[8,16] . 


. [16,23] 


[11,18] 


[8,14] 


Bahrain 


[13,19] 


[14,19] 


[17,30] . 


. [24,31] 


[20,26] 


[15,21] 


Bombay 


[19,28] 


[19,28] 


[22,30] . 


. [24,32] 


[24,30] 


[25,30] 


Tokyo 


[0,9] 


[0,10] 


[3,13] . 


. [13,21] 


[8,16] 


[2,12] 


Toronto 


[-8,-1] 


[-8,-1] 


[-4,4] . 


. [6,14] 


[-1,17] 


[-5,1] 


Vienna 


[-2,1] 


[-1,3] 


[1,8] . 


. [7,13] 


[2,7] 


[1,3] 


Zurich 


[-11,9] 


[-8,15] 


[-7,18] . 


. [5,23] 


[0,19] 


[-11,8] 



al. (2004). We performed a dynamic clustering using as the allocation function the 
Hausdorff L\ distance, the L 2 of De Carvalho et al. (2006), the De Carvalho adap- 
tive distance (De Souza et al. (2004)) and the L 2 Wasserstein one alternatively. We 
chose to obtain a partition into four clusters, and we compared the resulting par- 
tition to that a priori one given by experts using the Corrected Rand Index. The 
expert classification were the following (Guru et al. (2004)): Class 1 (Bahrain, Bom- 
bay, Cairo, Calcutta, Colombo, Dubai, Hong Kong, Kula Lampur, Madras, Manila, 
Mexico, Nairobi, New Delhi, Sidney); Class 2 (Amsterdam, Athens, Copenhagen, 
Frankfurt, Geneva, Lisbon, London, Madrid, Moscow, Munich, New York, Paris, 
Rome, San Francisco, Seoul, Stockholm, Tokyo, Toronto, Vienna, Zurich); Class 3 
(Mauritius); Class 4 (Tehran). 

Using the three different allocation functions, we obtained 3 optimal partitions 
into 4 clusters (Tab.). 2). On the basis of the dynamic clustering, we evaluated the 
obtained partitions with respect to the a priori ones using the Corrected Rand Indices 
(Hubert and Arabic, (1985)). 



5 Conclusion and perspectives 

Interval descriptions can be derived from measurements subject to error (/r±e). If 
they are assumed to be (probabilistic) models for the error term, Hausdorff distances 
are not influenced by the distribution of values and the Lq implicitly considers that 
all the information is equally concentrated on the bounds of intervals. The Wasser- 
stein distance permits the different position, variability and shape of the compared 
distributions to be evaluated and taken separately into account, clearing way for inter- 
preting data results. With a few modifications, it can also be used for the comparison 
of two fuzzy numbers measured by LR fuzzy variables. Further, being an Euclidean 
distance, it is easy to show that the Wasserstein distance satisfies the Konig-Huygens 
theorem for the decomposition of inertia. This allows us to apply the usual indices 
based on the comparison between the inter and the intra groups’ inertia for the eval- 
uation and the interpretation of the results of a clustering or of a classification proce- 
dure. 
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Table 2. Clusters obtained using different allocation functions. Last row: Corrected Rand In- 
dex (CRI) of the obtained partition compared with the expert partition 



c 


L 2 Wasserstein 


Adaptive L 2 


Hausdorff distance 


1 


Bahrain Bombay Cairo Calcutta 
Colombo Dubai HongKong 
KulaLumpur Madras Manila 
NewDelhi 


Bahrain Bombay Calcutta Colombo 
Dubai HongKong KulaLumpur 
Madras Manila NewDelhi 


Bahrain Dubai HongKong 
NewDelhi Cairo MexicoCity 
Nairobi 


2 


Amsterdam Copenhagen Frankfurt 
Geneva London Moscow Munich 
Paris Stockholm Toronto Vienna 
Zurich 


Amsterdam Copenhagen Frankfurt 
Geneva London Moscow Munich 
Paris Stockholm Toronto Vienna 


Amsterdam Copenhagen Frankfurt 
Geneva London Moscow Munich 
Paris Stockholm Toronto Vienna 
Zurich 


3 


Mauritius MexicoCity Nairobi 
Sydney 


Cairo Mauritius MexicoCity 
Nairobi Sydney 


Bombay Calcutta Colombo 
KulaLumpur Madras Manila 
Mauritius Sydney 




Athens Lisbon Madrid New York 


Athens Lisbon Madrid New York 


Athens Lisbon Madrid New York 


4 


Rome SanFrancisco Seoul Tehran 


Rome SanFrancisco Seoul Tehran 


Rome SanFrancisco Seoul Tehran 




Tokyo 


Tokyo Zurich 


Tokyo 


CRI 


0.53 


0.49 


0.46 



On the other hand, a lot of effort is required for the extension of the distance to 
the multivariate case. Indeed, here we just proposed an extension (in the sense of 
Minkowski) of the distance under the hypothesis of independence between the de- 
scriptors of a multidimensional interval datum. 
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